Assignment 7

Support Vector Machine on Donors Choose Dataset

Table of Contents

  1. Pre-processing
    1.1. Pre-processing of project subject category
    1.2. Pre-processing of project subject sub-category
    1.3. Text pre-processing
    1.4. Preprocessing of Project Grade Category
    1.5. Counting the number of words for project_title for SET-5
    1.6. Counting the number of words for project_essay for SET-5
    1.7. Computing Sentiment Score of each essay for SET-5
  2. Splitting of data
  3. Creating data matrix using feature engineering techniques
    3.1. Bag of Words Encoding
    3.2. TFIDF Encoding
    3.3. AVG W2V Encoding
    3.4. TFIDF AVG W2V Encoding
  4. One hot encoding of categorical features
    4.1. One hot encoding of state
    4.2. One hot encoding of teacher prefix
    4.3. One hot encoding of project grade category
    4.4. One hot encoding of project subject category
    4.5. One hot encoding of project subject sub-category
  5. Standardizing numerical features
    5.1. Standardizing teacher no. of previously posted projects
    5.2. Standardizing price
    5.3. Standardizing quantity
    5.4. Standardizing TOTALWORDS_TITLE for SET-5
    5.5. Standardizing TOTALWORDS_ESSAY for SET-5
    5.6. Note for Standardizing SENTIMENT SCORES for SET-5
  6. Concatinating all features
    6.1. Concatinating features for BoW
    6.2. Concatinating features for TFIDF
    6.3. Concatinating features for AVG W2V
    6.4. Concatinating features for TFIDF AVG W2V
    6.5. Concatinating features for SET-5
  7. (TASK-1) Applying Support Vector Machine
    7.1. Set-1 categorical, numerical features + project_title(BOW) + preprocessed_essay(BOW)
    7.2. Set-2 categorical, numerical features + project_title(TFIDF)+ preprocessed_essay(TFIDF)
    7.3. Set-3: categorical, numerical features + project_title(AVG W2V)+ preprocessed_essay(AVG W2V)
    7.4. Set-4: categorical, numerical features + project_title(TFIDF AVG W2V)+ preprocessed_essay(TFIDF AVG W2V)
    7.5. (TASK-2) Set-5: Apply Support Vector Machines on features by finding the best hyper paramter(Applying TruncatedSVD on TfidfVectorizer of essay text)
    7.5.1. Applying TruncatedSVD on TfidfVectorizer of essay text
    7.5.2. Applying SVM after TruncatedSVD on essay_text
  8. Conclusion using PrettyTable Library
In [1]:
#importing all the libraries
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

import sqlite3
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer

import re
# Tutorial about Python regular expressions: https://pymotw.com/2/re/
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import pickle

from tqdm import tqdm
import os

from plotly import plotly
import plotly.offline as offline
import plotly.graph_objs as go
offline.init_notebook_mode()
from collections import Counter
In [2]:
#Loading the dataset
project_data  = pd.read_csv('train_data.csv')
resource_data = pd.read_csv('resources.csv')
project_data.head(3)
Out[2]:
Unnamed: 0 id teacher_id teacher_prefix school_state project_submitted_datetime project_grade_category project_subject_categories project_subject_subcategories project_title project_essay_1 project_essay_2 project_essay_3 project_essay_4 project_resource_summary teacher_number_of_previously_posted_projects project_is_approved
0 160221 p253737 c90749f5d961ff158d4b4d1e7dc665fc Mrs. IN 2016-12-05 13:43:57 Grades PreK-2 Literacy & Language ESL, Literacy Educational Support for English Learners at Home My students are English learners that are work... \"The limits of your language are the limits o... NaN NaN My students need opportunities to practice beg... 0 0
1 140945 p258326 897464ce9ddc600bced1151f324dd63a Mr. FL 2016-10-25 09:22:10 Grades 6-8 History & Civics, Health & Sports Civics & Government, Team Sports Wanted: Projector for Hungry Learners Our students arrive to our school eager to lea... The projector we need for our school is very c... NaN NaN My students need a projector to help with view... 7 1
2 21895 p182444 3465aaf82da834c0582ebd0ef8040ca0 Ms. AZ 2016-08-31 12:03:56 Grades 6-8 Health & Sports Health & Wellness, Team Sports Soccer Equipment for AWESOME Middle School Stu... \r\n\"True champions aren't always the ones th... The students on the campus come to school know... NaN NaN My students need shine guards, athletic socks,... 1 0
In [3]:
#In the values we have considered, we have the division of accepted and rejected as follows.
#It can be said that it is an imbalanced dataset
project_data['project_is_approved'].value_counts()
Out[3]:
1    92706
0    16542
Name: project_is_approved, dtype: int64
In [4]:
# https://stackoverflow.com/questions/22407798/how-to-reset-a-dataframes-indexes-for-all-groups-in-one-step
price_data = resource_data.groupby('id').agg({'price':'sum', 'quantity':'sum'}).reset_index()

# join two dataframes in python: 
project_data = pd.merge(project_data, price_data, on='id', how='left')
In [5]:
print("Summary of Data: ", project_data.info())
<class 'pandas.core.frame.DataFrame'>
Int64Index: 109248 entries, 0 to 109247
Data columns (total 19 columns):
Unnamed: 0                                      109248 non-null int64
id                                              109248 non-null object
teacher_id                                      109248 non-null object
teacher_prefix                                  109245 non-null object
school_state                                    109248 non-null object
project_submitted_datetime                      109248 non-null object
project_grade_category                          109248 non-null object
project_subject_categories                      109248 non-null object
project_subject_subcategories                   109248 non-null object
project_title                                   109248 non-null object
project_essay_1                                 109248 non-null object
project_essay_2                                 109248 non-null object
project_essay_3                                 3758 non-null object
project_essay_4                                 3758 non-null object
project_resource_summary                        109248 non-null object
teacher_number_of_previously_posted_projects    109248 non-null int64
project_is_approved                             109248 non-null int64
price                                           109248 non-null float64
quantity                                        109248 non-null int64
dtypes: float64(1), int64(4), object(14)
memory usage: 16.7+ MB
Summary of Data:  None
In [6]:
# how to replace elements in list python: https://stackoverflow.com/a/2582163/4084039
cols = ['Date' if x=='project_submitted_datetime' else x for x in list(project_data.columns)]

#sort dataframe based on time pandas python: https://stackoverflow.com/a/49702492/4084039
project_data['Date'] = pd.to_datetime(project_data['project_submitted_datetime'])
project_data.drop('project_submitted_datetime', axis=1, inplace=True)
project_data.sort_values(by=['Date'], inplace=True)

# how to reorder columns pandas python: https://stackoverflow.com/a/13148611/4084039
project_data = project_data[cols]

project_data.head(2)
Out[6]:
Unnamed: 0 id teacher_id teacher_prefix school_state Date project_grade_category project_subject_categories project_subject_subcategories project_title project_essay_1 project_essay_2 project_essay_3 project_essay_4 project_resource_summary teacher_number_of_previously_posted_projects project_is_approved price quantity
55660 8393 p205479 2bf07ba08945e5d8b2a3f269b2b3cfe5 Mrs. CA 2016-04-27 00:27:36 Grades PreK-2 Math & Science Applied Sciences, Health & Life Science Engineering STEAM into the Primary Classroom I have been fortunate enough to use the Fairy ... My students come from a variety of backgrounds... Each month I try to do several science or STEM... It is challenging to develop high quality scie... My students need STEM kits to learn critical s... 53 1 725.05 4
76127 37728 p043609 3f60494c61921b3b43ab61bdde2904df Ms. UT 2016-04-27 00:31:25 Grades 3-5 Special Needs Special Needs Sensory Tools for Focus Imagine being 8-9 years old. You're in your th... Most of my students have autism, anxiety, anot... It is tough to do more than one thing at a tim... When my students are able to calm themselves d... My students need Boogie Boards for quiet senso... 4 1 213.03 8

1. Pre-processing

1.1 Preprocessing of project subject category

In [7]:
catogories = list(project_data['project_subject_categories'].values)
# remove special characters from list of strings python: https://stackoverflow.com/a/47301924/4084039

# https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
# https://stackoverflow.com/questions/23669024/how-to-strip-a-specific-word-from-a-string
# https://stackoverflow.com/questions/8270092/remove-all-whitespace-in-a-string-in-python
cat_list = []
for i in catogories:
    temp = ""
    # consider we have text like this "Math & Science, Warmth, Care & Hunger"
    for j in i.split(','): # it will split it in three parts ["Math & Science", "Warmth", "Care & Hunger"]
        if 'The' in j.split(): # this will split each of the catogory based on space "Math & Science"=> "Math","&", "Science"
            j=j.replace('The','') # if we have the words "The" we are going to replace it with ''(i.e removing 'The')
        j = j.replace(' ','') # we are placeing all the ' '(space) with ''(empty) ex:"Math & Science"=>"Math&Science"
        temp+=j.strip()+" " #" abc ".strip() will return "abc", remove the trailing spaces
        temp = temp.replace('&','_') # we are replacing the & value into 
    cat_list.append(temp.strip())
    
project_data['clean_categories'] = cat_list
project_data.drop(['project_subject_categories'], axis=1, inplace=True)

from collections import Counter
my_counter = Counter()
for word in project_data['clean_categories'].values:
    my_counter.update(word.split())

cat_dict = dict(my_counter)
sorted_cat_dict = dict(sorted(cat_dict.items(), key=lambda kv: kv[1]))

1.2 Preprocessing of project subject sub-category

In [8]:
sub_catogories = list(project_data['project_subject_subcategories'].values)
# remove special characters from list of strings python: https://stackoverflow.com/a/47301924/4084039

# https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
# https://stackoverflow.com/questions/23669024/how-to-strip-a-specific-word-from-a-string
# https://stackoverflow.com/questions/8270092/remove-all-whitespace-in-a-string-in-python

sub_cat_list = []
for i in sub_catogories:
    temp = ""
    # consider we have text like this "Math & Science, Warmth, Care & Hunger"
    for j in i.split(','): # it will split it in three parts ["Math & Science", "Warmth", "Care & Hunger"]
        if 'The' in j.split(): # this will split each of the catogory based on space "Math & Science"=> "Math","&", "Science"
            j=j.replace('The','') # if we have the words "The" we are going to replace it with ''(i.e removing 'The')
        j = j.replace(' ','') # we are placeing all the ' '(space) with ''(empty) ex:"Math & Science"=>"Math&Science"
        temp +=j.strip()+" "#" abc ".strip() will return "abc", remove the trailing spaces
        temp = temp.replace('&','_')
    sub_cat_list.append(temp.strip())

project_data['clean_subcategories'] = sub_cat_list
project_data.drop(['project_subject_subcategories'], axis=1, inplace=True)

# count of all the words in corpus python: https://stackoverflow.com/a/22898595/4084039
my_counter = Counter()
for word in project_data['clean_subcategories'].values:
    my_counter.update(word.split())
    
sub_cat_dict = dict(my_counter)
sorted_sub_cat_dict = dict(sorted(sub_cat_dict.items(), key=lambda kv: kv[1]))
In [9]:
#Citation: 
#url: https://stackoverflow.com/questions/14247586/python-pandas-how-to-select-rows-with-one-or-more-nulls-from-a-dataframe-without
project_data[project_data['teacher_prefix'].isnull()]
Out[9]:
Unnamed: 0 id teacher_id teacher_prefix school_state Date project_grade_category project_title project_essay_1 project_essay_2 project_essay_3 project_essay_4 project_resource_summary teacher_number_of_previously_posted_projects project_is_approved price quantity clean_categories clean_subcategories
30368 22174 p002730 339bd5a9e445d68a74d65b99cd325397 NaN SC 2016-05-09 09:38:40 Grades 9-12 iPads for STEM Stations Within the next 20 years, every job will invol... The students in our school come from a wide va... Students will use the iPad station for individ... Your generosity will allow my students to work... My students need 5 iPads for STEM stations. 0 1 285.86 16 Literacy_Language Literature_Writing
57654 158692 p197901 e4be6aaaa887d4202df2b647fbfc82bb NaN PA 2016-06-03 10:15:05 Grades 3-5 Document Camera Students at Robertsdale Elementary live in a l... This SMART Document Camera will improve my stu... NaN NaN My students need a Smart Document Camera to en... 0 1 145.29 2 Literacy_Language Math_Science Literacy Mathematics
7820 17809 p180947 834f75f1b5e24bd10abe9c3dbf7ba12f NaN CA 2016-11-04 00:15:45 Grades 3-5 1:7 Increasing Tech to Decrease Achievement Gaps The children at Anna Yates Elementary school a... My goal is to bring in 1 laptop for every 7 st... NaN NaN My students need a classroom laptop that is ju... 1 1 910.87 2 Literacy_Language Math_Science Literature_Writing Mathematics
In [10]:
#Dropping the rows which has NaN values.
project_data.drop([30368, 57654, 7820], inplace=True)

1.3 Text - preprocessing

In [11]:
#Pre processing of essay.
# merge two column text dataframe: 
project_data["essay"] = project_data["project_essay_1"].map(str) +\
                        project_data["project_essay_2"].map(str) + \
                        project_data["project_essay_3"].map(str) + \
                        project_data["project_essay_4"].map(str)
In [12]:
# printing some random reviews
print(project_data['essay'].values[0])
print("="*50)
print(project_data['essay'].values[150])
print("="*50)
print(project_data['essay'].values[1000])
print("="*50)
print(project_data['essay'].values[20000])
print("="*50)
I have been fortunate enough to use the Fairy Tale STEM kits in my classroom as well as the STEM journals, which my students really enjoyed.  I would love to implement more of the Lakeshore STEM kits in my classroom for the next school year as they provide excellent and engaging STEM lessons.My students come from a variety of backgrounds, including language and socioeconomic status.  Many of them don't have a lot of experience in science and engineering and these kits give me the materials to provide these exciting opportunities for my students.Each month I try to do several science or STEM/STEAM projects.  I would use the kits and robot to help guide my science instruction in engaging and meaningful ways.  I can adapt the kits to my current language arts pacing guide where we already teach some of the material in the kits like tall tales (Paul Bunyan) or Johnny Appleseed.  The following units will be taught in the next school year where I will implement these kits: magnets, motion, sink vs. float, robots.  I often get to these units and don't know If I am teaching the right way or using the right materials.    The kits will give me additional ideas, strategies, and lessons to prepare my students in science.It is challenging to develop high quality science activities.  These kits give me the materials I need to provide my students with science activities that will go along with the curriculum in my classroom.  Although I have some things (like magnets) in my classroom, I don't know how to use them effectively.  The kits will provide me with the right amount of materials and show me how to use them in an appropriate way.
==================================================
I teach high school English to students with learning and behavioral disabilities. My students all vary in their ability level. However, the ultimate goal is to increase all students literacy levels. This includes their reading, writing, and communication levels.I teach a really dynamic group of students. However, my students face a lot of challenges. My students all live in poverty and in a dangerous neighborhood. Despite these challenges, I have students who have the the desire to defeat these challenges. My students all have learning disabilities and currently all are performing below grade level. My students are visual learners and will benefit from a classroom that fulfills their preferred learning style.The materials I am requesting will allow my students to be prepared for the classroom with the necessary supplies.  Too often I am challenged with students who come to school unprepared for class due to economic challenges.  I want my students to be able to focus on learning and not how they will be able to get school supplies.  The supplies will last all year.  Students will be able to complete written assignments and maintain a classroom journal.  The chart paper will be used to make learning more visual in class and to create posters to aid students in their learning.  The students have access to a classroom printer.  The toner will be used to print student work that is completed on the classroom Chromebooks.I want to try and remove all barriers for the students learning and create opportunities for learning. One of the biggest barriers is the students not having the resources to get pens, paper, and folders. My students will be able to increase their literacy skills because of this project.
==================================================
\"Life moves pretty fast. If you don't stop and look around once in awhile, you could miss it.\"  from the movie, Ferris Bueller's Day Off.  Think back...what do you remember about your grandparents?  How amazing would it be to be able to flip through a book to see a day in their lives?My second graders are voracious readers! They love to read both fiction and nonfiction books.  Their favorite characters include Pete the Cat, Fly Guy, Piggie and Elephant, and Mercy Watson. They also love to read about insects, space and plants. My students are hungry bookworms! My students are eager to learn and read about the world around them. My kids love to be at school and are like little sponges absorbing everything around them. Their parents work long hours and usually do not see their children. My students are usually cared for by their grandparents or a family friend. Most of my students do not have someone who speaks English at home. Thus it is difficult for my students to acquire language.Now think forward... wouldn't it mean a lot to your kids, nieces or nephews or grandchildren, to be able to see a day in your life today 30 years from now? Memories are so precious to us and being able to share these memories with future generations will be a rewarding experience.  As part of our social studies curriculum, students will be learning about changes over time.  Students will be studying photos to learn about how their community has changed over time.  In particular, we will look at photos to study how the land, buildings, clothing, and schools have changed over time.  As a culminating activity, my students will capture a slice of their history and preserve it through scrap booking. Key important events in their young lives will be documented with the date, location, and names.   Students will be using photos from home and from school to create their second grade memories.   Their scrap books will preserve their unique stories for future generations to enjoy.Your donation to this project will provide my second graders with an opportunity to learn about social studies in a fun and creative manner.  Through their scrapbooks, children will share their story with others and have a historical document for the rest of their lives.
==================================================
Some of my students come from difficult family lives, but they don't let that stop them. We have built a community in our classroom that allows each student to be comfortable with who they are. Even though we are a diverse school, everyone feels included. We have a high Hispanic population, and about 90% of the students are on free or reduced-price lunch. Most students are living with a single parent or both parents work full time, although many parents are eager to help in any way they can.\r\nWe all know how important it is to get kids up and moving. I want my classroom to be a place where students can be active phyically and mentally. The requested items will allow my students to move all day. When they are sitting in a chair, their movement is limited.\r\n       Kindergarten students have a hard time sitting still for long periods of time. They would much rather bounce on a stability ball or wiggle on a cushion than sit in a hard chair. Having these choices in my classroom will allow students to be active and learn at the same time. \r\n        Having these choices in my classroom will also build a greater bond between the students.  They will learn to choose which seat best fits their learning style, and hopefully they will be able to help their classmates find a seat that works for them. As the students move around the room, they will be able to work with everyone instead of being with one group each day.nannan
==================================================
In [13]:
# https://stackoverflow.com/a/47091490/4084039
import re

def decontracted(phrase):
    # specific
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase
In [14]:
sent = decontracted(project_data['essay'].values[20000])
print(sent)
print("="*50)
Some of my students come from difficult family lives, but they do not let that stop them. We have built a community in our classroom that allows each student to be comfortable with who they are. Even though we are a diverse school, everyone feels included. We have a high Hispanic population, and about 90% of the students are on free or reduced-price lunch. Most students are living with a single parent or both parents work full time, although many parents are eager to help in any way they can.\r\nWe all know how important it is to get kids up and moving. I want my classroom to be a place where students can be active phyically and mentally. The requested items will allow my students to move all day. When they are sitting in a chair, their movement is limited.\r\n       Kindergarten students have a hard time sitting still for long periods of time. They would much rather bounce on a stability ball or wiggle on a cushion than sit in a hard chair. Having these choices in my classroom will allow students to be active and learn at the same time. \r\n        Having these choices in my classroom will also build a greater bond between the students.  They will learn to choose which seat best fits their learning style, and hopefully they will be able to help their classmates find a seat that works for them. As the students move around the room, they will be able to work with everyone instead of being with one group each day.nannan
==================================================
In [15]:
# \r \n \t remove from string python: http://texthandler.com/info/remove-line-breaks-python/
sent = sent.replace('\\r', ' ')
sent = sent.replace('\\"', ' ')
sent = sent.replace('\\n', ' ')
print(sent)
Some of my students come from difficult family lives, but they do not let that stop them. We have built a community in our classroom that allows each student to be comfortable with who they are. Even though we are a diverse school, everyone feels included. We have a high Hispanic population, and about 90% of the students are on free or reduced-price lunch. Most students are living with a single parent or both parents work full time, although many parents are eager to help in any way they can.  We all know how important it is to get kids up and moving. I want my classroom to be a place where students can be active phyically and mentally. The requested items will allow my students to move all day. When they are sitting in a chair, their movement is limited.         Kindergarten students have a hard time sitting still for long periods of time. They would much rather bounce on a stability ball or wiggle on a cushion than sit in a hard chair. Having these choices in my classroom will allow students to be active and learn at the same time.           Having these choices in my classroom will also build a greater bond between the students.  They will learn to choose which seat best fits their learning style, and hopefully they will be able to help their classmates find a seat that works for them. As the students move around the room, they will be able to work with everyone instead of being with one group each day.nannan
In [16]:
#remove special character: https://stackoverflow.com/a/5843547/4084039
sent = re.sub('[^A-Za-z0-9]+', ' ', sent)
print(sent)
Some of my students come from difficult family lives but they do not let that stop them We have built a community in our classroom that allows each student to be comfortable with who they are Even though we are a diverse school everyone feels included We have a high Hispanic population and about 90 of the students are on free or reduced price lunch Most students are living with a single parent or both parents work full time although many parents are eager to help in any way they can We all know how important it is to get kids up and moving I want my classroom to be a place where students can be active phyically and mentally The requested items will allow my students to move all day When they are sitting in a chair their movement is limited Kindergarten students have a hard time sitting still for long periods of time They would much rather bounce on a stability ball or wiggle on a cushion than sit in a hard chair Having these choices in my classroom will allow students to be active and learn at the same time Having these choices in my classroom will also build a greater bond between the students They will learn to choose which seat best fits their learning style and hopefully they will be able to help their classmates find a seat that works for them As the students move around the room they will be able to work with everyone instead of being with one group each day nannan
In [17]:
# https://gist.github.com/sebleier/554280
# we are removing the words from the stop words list: 'no', 'nor', 'not'
stopwords= ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
            "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
            'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
            'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
            'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
            'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
            'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
            'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
            'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
            'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
            's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
            've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',\
            "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',\
            "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", \
            'won', "won't", 'wouldn', "wouldn't"]
In [18]:
# Combining all the above stundents 
from tqdm import tqdm
preprocessed_essays = []
# tqdm is for printing the status bar
for sentance in tqdm(project_data['essay'].values):
    sent = decontracted(sentance)
    sent = sent.replace('\\r', ' ')
    sent = sent.replace('\\"', ' ')
    sent = sent.replace('\\n', ' ')
    sent = re.sub('[^A-Za-z0-9]+', ' ', sent)
    # https://gist.github.com/sebleier/554280
    sent = ' '.join(e for e in sent.split() if e.lower() not in stopwords)
    preprocessed_essays.append(sent.lower().strip())
100%|██████████| 109245/109245 [02:21<00:00, 772.54it/s]
In [19]:
#preprocessing of project title.
from tqdm import tqdm
preprocessed_title = []
# tqdm is for printing the status bar
for sentance in tqdm(project_data['project_title'].values):
    sent = decontracted(sentance)
    sent = sent.replace('\\r', ' ')
    sent = sent.replace('\\"', ' ')
    sent = sent.replace('\\n', ' ')
    sent = re.sub('[^A-Za-z0-9]+', ' ', sent)
    # https://gist.github.com/sebleier/554280
    sent = ' '.join(e for e in sent.split() if e.lower() not in stopwords)
    preprocessed_title.append(sent.lower().strip())
100%|██████████| 109245/109245 [00:06<00:00, 15838.46it/s]
In [20]:
#Replacing actual column values with the preprocessed ones.
project_data['project_title'] = preprocessed_title
project_data['essay'] = preprocessed_essays
In [21]:
#Since we have summed essay into one. We dont need 4 essays. 
#Citation: pandas drop a column
#url: https://stackoverflow.com/questions/13411544/delete-column-from-pandas-dataframe-by-column-name
columns = ['project_essay_1', 'project_essay_2', 'project_essay_3', 'project_essay_4']
project_data.drop(columns, axis=1, inplace=True)
In [22]:
y = project_data['project_is_approved'].values
project_data.drop(['project_is_approved'], axis=1, inplace=True)
X = project_data

1. 4. Preprocessing of Project Grade Category.

In [23]:
#As you can see we have four categories.
#While vectorizing this we get 'grades' separately. That is,
#Prek-2, 3-5, 6-8, 9-12, Grades -> We get 5 categories which is wrong. Hence rectifying it

X['project_grade_category'].value_counts()
Out[23]:
Grades PreK-2    44225
Grades 3-5       37135
Grades 6-8       16923
Grades 9-12      10962
Name: project_grade_category, dtype: int64
In [24]:
#We will replace the value of Grades Prek-2 with Grades-Prek-2
#And subsequently for other categories.

grade = X['project_grade_category']
In [25]:
#Citation
#url: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html

grade.replace('Grades PreK-2', 'Grades-PreK-2', inplace=True)
grade.replace('Grades 3-5', 'Grades-3-5', inplace=True)
grade.replace('Grades 6-8', 'Grades-6-8', inplace=True)
grade.replace('Grades 9-12', 'Grades-9-12', inplace=True)
In [26]:
#Assigning new grades with to the column
grade.head()
Out[26]:
55660    Grades-PreK-2
76127       Grades-3-5
51140    Grades-PreK-2
473      Grades-PreK-2
41558       Grades-3-5
Name: project_grade_category, dtype: object
In [27]:
X['project_grade_category'] = grade

1. 5. Counting the number of words for project_title for SET-5

In [28]:
#Counting the number of words in project_title for SET-5

#Citation: calculate number of words in text dataframe
#url: https://stackoverflow.com/questions/49984905/count-number-of-words-per-row
X['totalwords_title'] = X['project_title'].str.split().str.len()

1. 6. Counting the number of words for project_essay for SET-5

In [29]:
#Counting the number of words in essay for SET-5

#Citation: calculate number of words in text dataframe
#url: https://stackoverflow.com/questions/49984905/count-number-of-words-per-row
X['totalwords_essay'] = X['essay'].str.split().str.len()

1. 7. Computing Sentiment Score of each essay for SET-5

In [30]:
#Citation: store sentiment score in dataframe
#https://stackoverflow.com/questions/46764674/sentiment-analysis-on-dataframe
import warnings
warnings.filterwarnings('ignore')

import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()

abc = X['essay'].apply(lambda Text: sid.polarity_scores(Text))

# we can use these 4 things as features/attributes (neg, neu, pos, compound)
# neg: 0.0, neu: 0.753, pos: 0.247, compound: 0.93
In [31]:
#We can see that it returns a dictionary (key-value pairs). Therefore we will convert Series to dict 
type(abc)
Out[31]:
pandas.core.series.Series
In [32]:
final = abc.to_dict()
In [33]:
#What we have here is a Series, converting series to Pandas dataframe using from_dict() method
type(final)
Out[33]:
dict
In [34]:
final2 = pd.DataFrame.from_dict(final)
In [35]:
#We will need to transpose the DataFrame to get negative, positive etc values in the column region.
final2.head()
Out[35]:
55660 76127 51140 473 41558 29891 81565 79026 23374 86551 ... 11368 32881 84022 106793 27376 87154 14678 39096 87881 78306
compound 0.9867 0.9899 0.9864 0.9524 0.9873 0.9935 0.9977 0.9964 0.9484 0.9861 ... 0.9934 0.9911 0.9936 0.9931 0.9952 0.886 0.9705 0.796 0.9866 0.9913
neg 0.0130 0.0780 0.0160 0.0310 0.0310 0.0140 0.0200 0.0690 0.0680 0.0600 ... 0.0210 0.0630 0.0170 0.0250 0.0240 0.094 0.0530 0.046 0.0560 0.0400
neu 0.7730 0.6500 0.7060 0.7750 0.6530 0.6910 0.5540 0.6220 0.7680 0.6930 ... 0.6430 0.6490 0.7140 0.6640 0.6160 0.756 0.7110 0.861 0.6130 0.6220
pos 0.2140 0.2720 0.2780 0.1940 0.3150 0.2950 0.4260 0.3100 0.1640 0.2470 ... 0.3360 0.2880 0.2690 0.3110 0.3590 0.151 0.2360 0.093 0.3310 0.3380

4 rows × 109245 columns

In [36]:
final2 = final2.T
In [37]:
#This is the proper DataFrame.
final2.head()
Out[37]:
compound neg neu pos
55660 0.9867 0.013 0.773 0.214
76127 0.9899 0.078 0.650 0.272
51140 0.9864 0.016 0.706 0.278
473 0.9524 0.031 0.775 0.194
41558 0.9873 0.031 0.653 0.315
In [38]:
#Adding these values to our main Dataframe.
X['Compound Score'] = final2['compound']
X['Negative Score'] = final2['neg']
X['Neutral Score'] = final2['neu']
X['Positive Score'] = final2['pos']
In [39]:
#Final dataframe
X.head(3)
Out[39]:
Unnamed: 0 id teacher_id teacher_prefix school_state Date project_grade_category project_title project_resource_summary teacher_number_of_previously_posted_projects ... quantity clean_categories clean_subcategories essay totalwords_title totalwords_essay Compound Score Negative Score Neutral Score Positive Score
55660 8393 p205479 2bf07ba08945e5d8b2a3f269b2b3cfe5 Mrs. CA 2016-04-27 00:27:36 Grades-PreK-2 engineering steam primary classroom My students need STEM kits to learn critical s... 53 ... 4 Math_Science AppliedSciences Health_LifeScience fortunate enough use fairy tale stem kits clas... 4 156 0.9867 0.013 0.773 0.214
76127 37728 p043609 3f60494c61921b3b43ab61bdde2904df Ms. UT 2016-04-27 00:31:25 Grades-3-5 sensory tools focus My students need Boogie Boards for quiet senso... 4 ... 8 SpecialNeeds SpecialNeeds imagine 8 9 years old third grade classroom se... 3 159 0.9899 0.078 0.650 0.272
51140 74477 p189804 4a97f3a390bfe21b99cf5e2b81981c73 Mrs. CA 2016-04-27 00:46:53 Grades-PreK-2 mobile learning mobile listening center My students need a mobile listening center to ... 10 ... 1 Literacy_Language Literacy class 24 students comes diverse learners stude... 5 106 0.9864 0.016 0.706 0.278

3 rows × 21 columns

In [40]:
#final shape
X.shape
Out[40]:
(109245, 21)

2. Splitting the Data

In [41]:
# train test split before vectorizing or performing any feature engineering techniques
#as doing it before leads to data leakage.
from sklearn.model_selection import train_test_split

#shuffle=False for time based splitting
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, shuffle=False)
X_train, X_cv, y_train, y_cv = train_test_split(X_train, y_train, test_size=0.33, shuffle=False)

3. Creating Data Matrix using Feature Engineering Techniques.

3.1 Bag Of Words Encoding

In [42]:
%%time
#BoW for essay.
print(X_train.shape, y_train.shape)
print(X_cv.shape, y_cv.shape)
print(X_test.shape, y_test.shape)

print("="*100)

from sklearn.feature_extraction.text import CountVectorizer

vectorizer_essay = CountVectorizer(min_df=10)
vectorizer_essay.fit(X_train['essay'].values) # fit has to happen only on train data

# we use the fitted CountVectorizer to convert the text to vector
X_train_essay_bow = vectorizer_essay.transform(X_train['essay'].values)
X_cv_essay_bow = vectorizer_essay.transform(X_cv['essay'].values)
X_test_essay_bow = vectorizer_essay.transform(X_test['essay'].values)

print("After vectorizations")
print(X_train_essay_bow.shape, y_train.shape)
print(X_cv_essay_bow.shape, y_cv.shape)
print(X_test_essay_bow.shape, y_test.shape)
(49039, 21) (49039,)
(24155, 21) (24155,)
(36051, 21) (36051,)
====================================================================================================
After vectorizations
(49039, 11928) (49039,)
(24155, 11928) (24155,)
(36051, 11928) (36051,)
Wall time: 52 s
In [43]:
%%time
#BoW for project-title.
print(X_train.shape, y_train.shape)
print(X_cv.shape, y_cv.shape)
print(X_test.shape, y_test.shape)

print("="*100)

from sklearn.feature_extraction.text import CountVectorizer

vectorizer_title = CountVectorizer(min_df=10)
vectorizer_title.fit(X_train['project_title'].values) # fit has to happen only on train data

# we use the fitted CountVectorizer to convert the text to vector
X_train_title_bow = vectorizer_title.transform(X_train['project_title'].values)
X_cv_title_bow = vectorizer_title.transform(X_cv['project_title'].values)
X_test_title_bow = vectorizer_title.transform(X_test['project_title'].values)

print("After vectorizations")
print(X_train_title_bow.shape, y_train.shape)
print(X_cv_title_bow.shape, y_cv.shape)
print(X_test_title_bow.shape, y_test.shape)
(49039, 21) (49039,)
(24155, 21) (24155,)
(36051, 21) (36051,)
====================================================================================================
After vectorizations
(49039, 1937) (49039,)
(24155, 1937) (24155,)
(36051, 1937) (36051,)
Wall time: 2.63 s
In [44]:
#BoW for project-resource-summary
print(X_train.shape, y_train.shape)
print(X_cv.shape, y_cv.shape)
print(X_test.shape, y_test.shape)

print("="*100)

from sklearn.feature_extraction.text import CountVectorizer

vectorizer_resum = CountVectorizer(min_df=10)
vectorizer_resum.fit(X_train['project_resource_summary'].values) # fit has to happen only on train data

# we use the fitted CountVectorizer to convert the text to vector
X_train_resum_bow = vectorizer_resum.transform(X_train['project_resource_summary'].values)
X_cv_resum_bow = vectorizer_resum.transform(X_cv['project_resource_summary'].values)
X_test_resum_bow = vectorizer_resum.transform(X_test['project_resource_summary'].values)

print("After vectorizations")
print(X_train_resum_bow.shape, y_train.shape)
print(X_cv_resum_bow.shape, y_cv.shape)
print(X_test_resum_bow.shape, y_test.shape)
(49039, 21) (49039,)
(24155, 21) (24155,)
(36051, 21) (36051,)
====================================================================================================
After vectorizations
(49039, 4068) (49039,)
(24155, 4068) (24155,)
(36051, 4068) (36051,)

3.2 TF-IDF Encoding

In [45]:
#TF-IDF Encoding on project essay
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer_essay_tfidf = TfidfVectorizer(min_df=10)
vectorizer_essay_tfidf.fit(X_train['essay'].values) # fit has to happen only on train data

# we use the fitted CountVectorizer to convert the text to vector
X_train_essay_tfidf = vectorizer_essay_tfidf.transform(X_train['essay'].values)
X_cv_essay_tfidf = vectorizer_essay_tfidf.transform(X_cv['essay'].values)
X_test_essay_tfidf = vectorizer_essay_tfidf.transform(X_test['essay'].values)

print("After vectorizations")
print(X_train_essay_tfidf.shape, y_train.shape)
print(X_cv_essay_tfidf.shape, y_cv.shape)
print(X_test_essay_tfidf.shape, y_test.shape)
After vectorizations
(49039, 11928) (49039,)
(24155, 11928) (24155,)
(36051, 11928) (36051,)
In [46]:
#TF-IDF Encoding on project title
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer_title_tfidf = CountVectorizer(min_df=10)
vectorizer_title_tfidf.fit(X_train['project_title'].values) # fit has to happen only on train data

# we use the fitted CountVectorizer to convert the text to vector
X_train_title_tfidf = vectorizer_title_tfidf.transform(X_train['project_title'].values)
X_cv_title_tfidf = vectorizer_title_tfidf.transform(X_cv['project_title'].values)
X_test_title_tfidf = vectorizer_title_tfidf.transform(X_test['project_title'].values)

print("After vectorizations")
print(X_train_title_tfidf.shape, y_train.shape)
print(X_cv_title_tfidf.shape, y_cv.shape)
print(X_test_title_tfidf.shape, y_test.shape)
After vectorizations
(49039, 1937) (49039,)
(24155, 1937) (24155,)
(36051, 1937) (36051,)
In [47]:
#TF-IDF Encoding on project res summary

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer_resum_tfidf = CountVectorizer(min_df=10)
vectorizer_resum_tfidf.fit(X_train['project_resource_summary'].values) # fit has to happen only on train data

# we use the fitted CountVectorizer to convert the text to vector
X_train_resum_tfidf = vectorizer_title_tfidf.transform(X_train['project_resource_summary'].values)
X_cv_resum_tfidf = vectorizer_title_tfidf.transform(X_cv['project_resource_summary'].values)
X_test_resum_tfidf = vectorizer_title_tfidf.transform(X_test['project_resource_summary'].values)

print("After vectorizations")
print(X_train_resum_tfidf.shape, y_train.shape)
print(X_cv_resum_tfidf.shape, y_cv.shape)
print(X_test_resum_tfidf.shape, y_test.shape)
After vectorizations
(49039, 1937) (49039,)
(24155, 1937) (24155,)
(36051, 1937) (36051,)

3.3 AVG W2V Encoding

In [48]:
#ON PRE-PROCESSED ESSAY
# stronging variables into pickle files python: http://www.jessicayung.com/how-to-use-pickle-to-save-and-load-variables-in-python/
# make sure you have the glove_vectors file
with open('glove_vectors', 'rb') as f:
    model = pickle.load(f)
    glove_words =  set(model.keys())
    
#Train part of preprocessed essays.
train_w2v_vectors_essays = []; # the avg-w2v for each essay is stored in this list
for sentence in tqdm(X_train['essay'].values): # for each essay in training data
    vector = np.zeros(300) # as word vectors are of zero length
    cnt_words =0; # num of words with a valid vector in the essay
    for word in sentence.split(): # for each word in a essay
        if word in glove_words:
            vector += model[word]
            cnt_words += 1
    if cnt_words != 0:
        vector /= cnt_words
    train_w2v_vectors_essays.append(vector)
print("Train Vector for essay")
print(len(train_w2v_vectors_essays))
print(len(train_w2v_vectors_essays[0]))

#Test part of preprocessed essays.
# average Word2Vec
# compute average word2vec for each review.
test_w2v_vectors_essays = []; # the avg-w2v for each essay is stored in this list
for sentence in tqdm(X_test['essay'].values): # for each essay in training data
    vector = np.zeros(300) # as word vectors are of zero length
    cnt_words =0; # num of words with a valid vector in the essay
    for word in sentence.split(): # for each word in a essay
        if word in glove_words:
            vector += model[word]
            cnt_words += 1
    if cnt_words != 0:
        vector /= cnt_words
    test_w2v_vectors_essays.append(vector)

print("Test Vector for essay")
print(len(test_w2v_vectors_essays))
print(len(test_w2v_vectors_essays[0]))

#CV part of preprocessed essays.
# average Word2Vec
# compute average word2vec for each review.
cv_w2v_vectors_essays = []; # the avg-w2v for each essay is stored in this list
for sentence in tqdm(X_cv['essay'].values): # for each essay in training data
    vector = np.zeros(300) # as word vectors are of zero length
    cnt_words =0; # num of words with a valid vector in the essay
    for word in sentence.split(): # for each word in a essay
        if word in glove_words:
            vector += model[word]
            cnt_words += 1
    if cnt_words != 0:
        vector /= cnt_words
    cv_w2v_vectors_essays.append(vector)

print("CV vector for essay")
print(len(cv_w2v_vectors_essays))
print(len(cv_w2v_vectors_essays[0]))

# Changing the lists (Train, Test, CV) to numpy arrays
train_w2v_vectors_essays = np.array(train_w2v_vectors_essays)
test_w2v_vectors_essays = np.array(test_w2v_vectors_essays)
cv_w2v_vectors_essays = np.array(cv_w2v_vectors_essays)
100%|██████████| 49039/49039 [00:43<00:00, 1133.03it/s]
Train Vector for essay
49039
300
100%|██████████| 36051/36051 [00:31<00:00, 1159.11it/s]
Test Vector for essay
36051
300
100%|██████████| 24155/24155 [00:19<00:00, 1222.46it/s]
CV vector for essay
24155
300
In [49]:
#Following the same process for preprocessed titles.
# average Word2Vec
# compute average word2vec for each review.
w2v_train_vectors_titles = []; # the avg-w2v for each essay is stored in this list
for sentence in tqdm(X_train['project_title'].values): # for each essay in training data
    vector = np.zeros(300) # as word vectors are of zero length
    cnt_words =0; # num of words with a valid vector in the essay
    for word in sentence.split(): # for each word in a essay
        if word in glove_words:
            vector += model[word]
            cnt_words += 1
    if cnt_words != 0:
        vector /= cnt_words
    w2v_train_vectors_titles.append(vector)
print("Train Vector for project title")
print(len(w2v_train_vectors_titles))
print(len(w2v_train_vectors_titles[0]))

# average Word2Vec
# compute average word2vec for each review.
w2v_test_vectors_titles = []; # the avg-w2v for each essay is stored in this list
for sentence in tqdm(X_test['project_title'].values): # for each essay in training data
    vector = np.zeros(300) # as word vectors are of zero length
    cnt_words =0; # num of words with a valid vector in the essay
    for word in sentence.split(): # for each word in a essay
        if word in glove_words:
            vector += model[word]
            cnt_words += 1
    if cnt_words != 0:
        vector /= cnt_words
    w2v_test_vectors_titles.append(vector)

print("Test Vector for project title")
print(len(w2v_test_vectors_titles))
print(len(w2v_test_vectors_titles[0]))

# average Word2Vec
# compute average word2vec for each review.
w2v_cv_vectors_titles = []; # the avg-w2v for each essay is stored in this list
for sentence in tqdm(X_cv['project_title'].values): # for each essay in training data
    vector = np.zeros(300) # as word vectors are of zero length
    cnt_words =0; # num of words with a valid vector in the essay
    for word in sentence.split(): # for each word in a essay
        if word in glove_words:
            vector += model[word]
            cnt_words += 1
    if cnt_words != 0:
        vector /= cnt_words
    w2v_cv_vectors_titles.append(vector)

print("CV Vector for project title")
print(len(w2v_cv_vectors_titles))
print(len(w2v_cv_vectors_titles[0]))
100%|██████████| 49039/49039 [00:02<00:00, 18987.87it/s]
Train Vector for project title
49039
300
100%|██████████| 36051/36051 [00:01<00:00, 20678.14it/s]
Test Vector for project title
36051
300
100%|██████████| 24155/24155 [00:01<00:00, 23629.10it/s]
CV Vector for project title
24155
300
In [50]:
# Changing the lists (Train, Test, CV) to numpy arrays
w2v_train_vectors_titles = np.array(w2v_train_vectors_titles)
w2v_test_vectors_titles = np.array(w2v_test_vectors_titles)
w2v_cv_vectors_titles = np.array(w2v_cv_vectors_titles)
In [51]:
#Following the same process for project resource summary.
# average Word2Vec
# compute average word2vec for each review.
w2v_train_vectors_resum = []; # the avg-w2v for each essay is stored in this list
for sentence in tqdm(X_train['project_resource_summary'].values): # for each essay in training data
    vector = np.zeros(300) # as word vectors are of zero length
    cnt_words =0; # num of words with a valid vector in the essay
    for word in sentence.split(): # for each word in a essay
        if word in glove_words:
            vector += model[word]
            cnt_words += 1
    if cnt_words != 0:
        vector /= cnt_words
    w2v_train_vectors_resum.append(vector)
print("Train Vector for project resource summary")
print(len(w2v_train_vectors_resum))
print(len(w2v_train_vectors_resum[0]))

# average Word2Vec
# compute average word2vec for each review.
w2v_test_vectors_resum = []; # the avg-w2v for each essay is stored in this list
for sentence in tqdm(X_test['project_resource_summary'].values): # for each essay in training data
    vector = np.zeros(300) # as word vectors are of zero length
    cnt_words =0; # num of words with a valid vector in the essay
    for word in sentence.split(): # for each word in a essay
        if word in glove_words:
            vector += model[word]
            cnt_words += 1
    if cnt_words != 0:
        vector /= cnt_words
    w2v_test_vectors_resum.append(vector)

print("Test Vector for project resource summary")
print(len(w2v_test_vectors_resum))
print(len(w2v_test_vectors_resum[0]))

# average Word2Vec
# compute average word2vec for each review.
w2v_cv_vectors_resum = []; # the avg-w2v for each essay is stored in this list
for sentence in tqdm(X_cv['project_resource_summary'].values): # for each essay in training data
    vector = np.zeros(300) # as word vectors are of zero length
    cnt_words =0; # num of words with a valid vector in the essay
    for word in sentence.split(): # for each word in a essay
        if word in glove_words:
            vector += model[word]
            cnt_words += 1
    if cnt_words != 0:
        vector /= cnt_words
    w2v_cv_vectors_resum.append(vector)

print("CV Vector for project resource summary")
print(len(w2v_cv_vectors_resum))
print(len(w2v_cv_vectors_resum[0]))
100%|██████████| 49039/49039 [00:05<00:00, 9000.69it/s]
Train Vector for project resource summary
49039
300
100%|██████████| 36051/36051 [00:03<00:00, 9444.58it/s] 
Test Vector for project resource summary
36051
300
100%|██████████| 24155/24155 [00:03<00:00, 7896.99it/s]
CV Vector for project resource summary
24155
300
In [52]:
# Changing the lists (Train, Test, CV) to numpy arrays
w2v_train_vectors_resum = np.array(w2v_train_vectors_resum)
w2v_test_vectors_resum = np.array(w2v_test_vectors_resum)
w2v_cv_vectors_resum = np.array(w2v_cv_vectors_resum)

3.4 TFIDF AVG W2V Encoding

In [53]:
tfidf_model_essay = TfidfVectorizer()
tfidf_model_essay.fit(X_train['essay'].values)

# we are converting a dictionary with word as a key, and the idf as a value
dictionary = dict(zip(tfidf_model_essay.get_feature_names(), list(tfidf_model_essay.idf_)))
tfidf_words = set(tfidf_model_essay.get_feature_names())
In [54]:
#ON PRE-PROCESSED ESSAY
# stronging variables into pickle files python: http://www.jessicayung.com/how-to-use-pickle-to-save-and-load-variables-in-python/
# make sure you have the glove_vectors file
with open('glove_vectors', 'rb') as f:
    model = pickle.load(f)
    glove_words =  set(model.keys())

train_tfidf_w2v_essay = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(X_train['essay']): # for each review/sentence
    vector = np.zeros(300) # as word vectors are of zero length
    tf_idf_weight =0; # num of words with a valid vector in the sentence/review
    for word in sentence.split(): # for each word in a review/sentence
        if (word in glove_words) and (word in tfidf_words):
            vec = model[word] # getting the vector for each word
            # here we are multiplying idf value(dictionary[word]) and the tf value((sentence.count(word)/len(sentence.split())))
            tf_idf = dictionary[word]*(sentence.count(word)/len(sentence.split())) # getting the tfidf value for each word
            vector += (vec * tf_idf) # calculating tfidf weighted w2v
            tf_idf_weight += tf_idf
    if tf_idf_weight != 0:
        vector /= tf_idf_weight
    train_tfidf_w2v_essay.append(vector)

print(len(train_tfidf_w2v_essay))
print(len(train_tfidf_w2v_essay[0]))

cv_tfidf_w2v_essay = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(X_cv['essay']): # for each review/sentence
    vector = np.zeros(300) # as word vectors are of zero length
    tf_idf_weight =0; # num of words with a valid vector in the sentence/review
    for word in sentence.split(): # for each word in a review/sentence
        if (word in glove_words) and (word in tfidf_words):
            vec = model[word] # getting the vector for each word
            # here we are multiplying idf value(dictionary[word]) and the tf value((sentence.count(word)/len(sentence.split())))
            tf_idf = dictionary[word]*(sentence.count(word)/len(sentence.split())) # getting the tfidf value for each word
            vector += (vec * tf_idf) # calculating tfidf weighted w2v
            tf_idf_weight += tf_idf
    if tf_idf_weight != 0:
        vector /= tf_idf_weight
    cv_tfidf_w2v_essay.append(vector)

print(len(cv_tfidf_w2v_essay))
print(len(cv_tfidf_w2v_essay[0]))

test_tfidf_w2v_essay = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(X_test['essay']): # for each review/sentence
    vector = np.zeros(300) # as word vectors are of zero length
    tf_idf_weight =0; # num of words with a valid vector in the sentence/review
    for word in sentence.split(): # for each word in a review/sentence
        if (word in glove_words) and (word in tfidf_words):
            vec = model[word] # getting the vector for each word
            # here we are multiplying idf value(dictionary[word]) and the tf value((sentence.count(word)/len(sentence.split())))
            tf_idf = dictionary[word]*(sentence.count(word)/len(sentence.split())) # getting the tfidf value for each word
            vector += (vec * tf_idf) # calculating tfidf weighted w2v
            tf_idf_weight += tf_idf
    if tf_idf_weight != 0:
        vector /= tf_idf_weight
    test_tfidf_w2v_essay.append(vector)

print(len(test_tfidf_w2v_essay))
print(len(test_tfidf_w2v_essay[0]))

# Changing list to numpy arrays
train_tfidf_w2v_essay = np.array(train_tfidf_w2v_essay)
test_tfidf_w2v_essay = np.array(test_tfidf_w2v_essay)
cv_tfidf_w2v_essay = np.array(cv_tfidf_w2v_essay)
100%|██████████| 49039/49039 [04:40<00:00, 174.62it/s]
49039
300
100%|██████████| 24155/24155 [02:17<00:00, 175.37it/s]
24155
300
100%|██████████| 36051/36051 [03:31<00:00, 138.17it/s]
36051
300
In [55]:
tfidf_model_title = TfidfVectorizer()
tfidf_model_title.fit(X_train['project_title'].values)

# we are converting a dictionary with word as a key, and the idf as a value
dictionary = dict(zip(tfidf_model_title.get_feature_names(), list(tfidf_model_title.idf_)))
tfidf_words = set(tfidf_model_title.get_feature_names())

#ON PREPROCESSED TITLE
train_tfidf_w2v_title = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(X_train['project_title']): # for each review/sentence
    vector = np.zeros(300) # as word vectors are of zero length
    tf_idf_weight =0; # num of words with a valid vector in the sentence/review
    for word in sentence.split(): # for each word in a review/sentence
        if (word in glove_words) and (word in tfidf_words):
            vec = model[word] # getting the vector for each word
            # here we are multiplying idf value(dictionary[word]) and the tf value((sentence.count(word)/len(sentence.split())))
            tf_idf = dictionary[word]*(sentence.count(word)/len(sentence.split())) # getting the tfidf value for each word
            vector += (vec * tf_idf) # calculating tfidf weighted w2v
            tf_idf_weight += tf_idf
    if tf_idf_weight != 0:
        vector /= tf_idf_weight
    train_tfidf_w2v_title.append(vector)

print(len(train_tfidf_w2v_title))
print(len(train_tfidf_w2v_title[0]))

cv_tfidf_w2v_title = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(X_cv['project_title']): # for each review/sentence
    vector = np.zeros(300) # as word vectors are of zero length
    tf_idf_weight =0; # num of words with a valid vector in the sentence/review
    for word in sentence.split(): # for each word in a review/sentence
        if (word in glove_words) and (word in tfidf_words):
            vec = model[word] # getting the vector for each word
            # here we are multiplying idf value(dictionary[word]) and the tf value((sentence.count(word)/len(sentence.split())))
            tf_idf = dictionary[word]*(sentence.count(word)/len(sentence.split())) # getting the tfidf value for each word
            vector += (vec * tf_idf) # calculating tfidf weighted w2v
            tf_idf_weight += tf_idf
    if tf_idf_weight != 0:
        vector /= tf_idf_weight
    cv_tfidf_w2v_title.append(vector)

print(len(cv_tfidf_w2v_title))
print(len(cv_tfidf_w2v_title[0]))

test_tfidf_w2v_title = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(X_test['project_title']): # for each review/sentence
    vector = np.zeros(300) # as word vectors are of zero length
    tf_idf_weight =0; # num of words with a valid vector in the sentence/review
    for word in sentence.split(): # for each word in a review/sentence
        if (word in glove_words) and (word in tfidf_words):
            vec = model[word] # getting the vector for each word
            # here we are multiplying idf value(dictionary[word]) and the tf value((sentence.count(word)/len(sentence.split())))
            tf_idf = dictionary[word]*(sentence.count(word)/len(sentence.split())) # getting the tfidf value for each word
            vector += (vec * tf_idf) # calculating tfidf weighted w2v
            tf_idf_weight += tf_idf
    if tf_idf_weight != 0:
        vector /= tf_idf_weight
    test_tfidf_w2v_title.append(vector)

print(len(test_tfidf_w2v_title))
print(len(test_tfidf_w2v_title[0]))

# Changing list to numpy arrays
train_tfidf_w2v_title = np.array(train_tfidf_w2v_title)
test_tfidf_w2v_title = np.array(test_tfidf_w2v_title)
cv_tfidf_w2v_title = np.array(cv_tfidf_w2v_title)
100%|██████████| 49039/49039 [00:04<00:00, 10393.95it/s]
49039
300
100%|██████████| 24155/24155 [00:02<00:00, 10236.94it/s]
24155
300
100%|██████████| 36051/36051 [00:03<00:00, 9253.39it/s] 
36051
300
In [56]:
tfidf_model_res_sum = TfidfVectorizer()
tfidf_model_res_sum.fit(X_train['project_resource_summary'].values)
Out[56]:
TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)
In [57]:
# we are converting a dictionary with word as a key, and the idf as a value
dictionary = dict(zip(tfidf_model_res_sum.get_feature_names(), list(tfidf_model_res_sum.idf_)))
tfidf_words = set(tfidf_model_res_sum.get_feature_names())

#ON PREPROCESSED TITLE
train_tfidf_w2v_resum = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(X_train['project_resource_summary']): # for each review/sentence
    vector = np.zeros(300) # as word vectors are of zero length
    tf_idf_weight =0; # num of words with a valid vector in the sentence/review
    for word in sentence.split(): # for each word in a review/sentence
        if (word in glove_words) and (word in tfidf_words):
            vec = model[word] # getting the vector for each word
            # here we are multiplying idf value(dictionary[word]) and the tf value((sentence.count(word)/len(sentence.split())))
            tf_idf = dictionary[word]*(sentence.count(word)/len(sentence.split())) # getting the tfidf value for each word
            vector += (vec * tf_idf) # calculating tfidf weighted w2v
            tf_idf_weight += tf_idf
    if tf_idf_weight != 0:
        vector /= tf_idf_weight
    train_tfidf_w2v_resum.append(vector)

print(len(train_tfidf_w2v_resum))
print(len(train_tfidf_w2v_resum[0]))

cv_tfidf_w2v_resum = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(X_cv['project_resource_summary']): # for each review/sentence
    vector = np.zeros(300) # as word vectors are of zero length
    tf_idf_weight =0; # num of words with a valid vector in the sentence/review
    for word in sentence.split(): # for each word in a review/sentence
        if (word in glove_words) and (word in tfidf_words):
            vec = model[word] # getting the vector for each word
            # here we are multiplying idf value(dictionary[word]) and the tf value((sentence.count(word)/len(sentence.split())))
            tf_idf = dictionary[word]*(sentence.count(word)/len(sentence.split())) # getting the tfidf value for each word
            vector += (vec * tf_idf) # calculating tfidf weighted w2v
            tf_idf_weight += tf_idf
    if tf_idf_weight != 0:
        vector /= tf_idf_weight
    cv_tfidf_w2v_resum.append(vector)

print(len(cv_tfidf_w2v_resum))
print(len(cv_tfidf_w2v_resum[0]))

test_tfidf_w2v_resum = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(X_test['project_resource_summary']): # for each review/sentence
    vector = np.zeros(300) # as word vectors are of zero length
    tf_idf_weight =0; # num of words with a valid vector in the sentence/review
    for word in sentence.split(): # for each word in a review/sentence
        if (word in glove_words) and (word in tfidf_words):
            vec = model[word] # getting the vector for each word
            # here we are multiplying idf value(dictionary[word]) and the tf value((sentence.count(word)/len(sentence.split())))
            tf_idf = dictionary[word]*(sentence.count(word)/len(sentence.split())) # getting the tfidf value for each word
            vector += (vec * tf_idf) # calculating tfidf weighted w2v
            tf_idf_weight += tf_idf
    if tf_idf_weight != 0:
        vector /= tf_idf_weight
    test_tfidf_w2v_resum.append(vector)

print(len(test_tfidf_w2v_resum))
print(len(test_tfidf_w2v_resum[0]))

# Changing list to numpy arrays
train_tfidf_w2v_resum = np.array(train_tfidf_w2v_resum)
test_tfidf_w2v_resum = np.array(test_tfidf_w2v_resum)
cv_tfidf_w2v_resum = np.array(cv_tfidf_w2v_resum)
100%|██████████| 49039/49039 [00:18<00:00, 2582.62it/s]
49039
300
100%|██████████| 24155/24155 [00:08<00:00, 2771.28it/s]
24155
300
100%|██████████| 36051/36051 [00:12<00:00, 2994.62it/s]
36051
300

4. One Hot Encoding of Categorical features.

4.1 One Hot Encoding of STATE.

In [58]:
vectorizer = CountVectorizer()
vectorizer.fit(X_train['school_state'].values) # fit has to happen only on train data

# we use the fitted CountVectorizer to convert the text to vector
X_train_state_ohe = vectorizer.transform(X_train['school_state'].values)
X_cv_state_ohe = vectorizer.transform(X_cv['school_state'].values)
X_test_state_ohe = vectorizer.transform(X_test['school_state'].values)

print("After vectorizations")
print(X_train_state_ohe.shape, y_train.shape)
print(X_cv_state_ohe.shape, y_cv.shape)
print(X_test_state_ohe.shape, y_test.shape)
print(vectorizer.get_feature_names())
print("="*100)
After vectorizations
(49039, 51) (49039,)
(24155, 51) (24155,)
(36051, 51) (36051,)
['ak', 'al', 'ar', 'az', 'ca', 'co', 'ct', 'dc', 'de', 'fl', 'ga', 'hi', 'ia', 'id', 'il', 'in', 'ks', 'ky', 'la', 'ma', 'md', 'me', 'mi', 'mn', 'mo', 'ms', 'mt', 'nc', 'nd', 'ne', 'nh', 'nj', 'nm', 'nv', 'ny', 'oh', 'ok', 'or', 'pa', 'ri', 'sc', 'sd', 'tn', 'tx', 'ut', 'va', 'vt', 'wa', 'wi', 'wv', 'wy']
====================================================================================================

4.2 One Hot Encoding of TEACHER-PREFIX.

In [59]:
vectorizer = CountVectorizer()
vectorizer.fit(X_train['teacher_prefix'].values) # fit has to happen only on train data

# we use the fitted CountVectorizer to convert the text to vector
X_train_teacher_ohe = vectorizer.transform(X_train['teacher_prefix'].values)
X_cv_teacher_ohe = vectorizer.transform(X_cv['teacher_prefix'].values)
X_test_teacher_ohe = vectorizer.transform(X_test['teacher_prefix'].values)

print("After vectorizations")
print(X_train_teacher_ohe.shape, y_train.shape)
print(X_cv_teacher_ohe.shape, y_cv.shape)
print(X_test_teacher_ohe.shape, y_test.shape)
print(vectorizer.get_feature_names())
print("="*100)
After vectorizations
(49039, 5) (49039,)
(24155, 5) (24155,)
(36051, 5) (36051,)
['dr', 'mr', 'mrs', 'ms', 'teacher']
====================================================================================================

4.3 One Hot Encoding of PROJECT_GRADE_CATEGORY.

In [60]:
from collections import Counter
Counter = Counter()
for word in X_train['project_grade_category'].values:
    Counter.update(word.split())

# dict sort by value python: https://stackoverflow.com/a/613218/4084039
project_grade_category_dict = dict(Counter)
sorted_project_grade_category_dict = dict(sorted(project_grade_category_dict.items(), key=lambda kv: kv[1]))
In [61]:
vectorizer = CountVectorizer(vocabulary=list(sorted_project_grade_category_dict.keys()), lowercase=False, binary=True)
vectorizer.fit(X_train['project_grade_category'].values) # fit has to happen only on train data

# we use the fitted CountVectorizer to convert the text to vector
X_train_grade_ohe = vectorizer.transform(X_train['project_grade_category'].values)
X_cv_grade_ohe = vectorizer.transform(X_cv['project_grade_category'].values)
X_test_grade_ohe = vectorizer.transform(X_test['project_grade_category'].values)

print("After vectorizations")
print(X_train_grade_ohe.shape, y_train.shape)
print(X_cv_grade_ohe.shape, y_cv.shape)
print(X_test_grade_ohe.shape, y_test.shape)
print(vectorizer.get_feature_names())
print("="*100)
After vectorizations
(49039, 4) (49039,)
(24155, 4) (24155,)
(36051, 4) (36051,)
['Grades-9-12', 'Grades-6-8', 'Grades-3-5', 'Grades-PreK-2']
====================================================================================================

4.4 One Hot Encoding of PROJECT SUBJECT CATEGORY

In [62]:
vectorizer = CountVectorizer()
vectorizer.fit(X_train['clean_categories'].values) # fit has to happen only on train data

# we use the fitted CountVectorizer to convert the text to vector
X_train_clean_cat_ohe = vectorizer.transform(X_train['clean_categories'].values)
X_cv_clean_cat_ohe = vectorizer.transform(X_cv['clean_categories'].values)
X_test_clean_cat_ohe = vectorizer.transform(X_test['clean_categories'].values)

print("After vectorizations")
print(X_train_clean_cat_ohe.shape, y_train.shape)
print(X_cv_clean_cat_ohe.shape, y_cv.shape)
print(X_test_clean_cat_ohe.shape, y_test.shape)
print(vectorizer.get_feature_names())
print("="*100)
After vectorizations
(49039, 9) (49039,)
(24155, 9) (24155,)
(36051, 9) (36051,)
['appliedlearning', 'care_hunger', 'health_sports', 'history_civics', 'literacy_language', 'math_science', 'music_arts', 'specialneeds', 'warmth']
====================================================================================================

4.5 One Hot Encoding of PROJECT SUBJECT SUB-CATEGORY

In [63]:
vectorizer = CountVectorizer()
vectorizer.fit(X_train['clean_subcategories'].values) # fit has to happen only on train data

# we use the fitted CountVectorizer to convert the text to vector
X_train_clean_subcat_ohe = vectorizer.transform(X_train['clean_subcategories'].values)
X_cv_clean_subcat_ohe = vectorizer.transform(X_cv['clean_subcategories'].values)
X_test_clean_subcat_ohe = vectorizer.transform(X_test['clean_subcategories'].values)

print("After vectorizations")
print(X_train_clean_subcat_ohe.shape, y_train.shape)
print(X_cv_clean_subcat_ohe.shape, y_cv.shape)
print(X_test_clean_subcat_ohe.shape, y_test.shape)
print(vectorizer.get_feature_names())
print("="*100)
After vectorizations
(49039, 30) (49039,)
(24155, 30) (24155,)
(36051, 30) (36051,)
['appliedsciences', 'care_hunger', 'charactereducation', 'civics_government', 'college_careerprep', 'communityservice', 'earlydevelopment', 'economics', 'environmentalscience', 'esl', 'extracurricular', 'financialliteracy', 'foreignlanguages', 'gym_fitness', 'health_lifescience', 'health_wellness', 'history_geography', 'literacy', 'literature_writing', 'mathematics', 'music', 'nutritioneducation', 'other', 'parentinvolvement', 'performingarts', 'socialsciences', 'specialneeds', 'teamsports', 'visualarts', 'warmth']
====================================================================================================

5. Standardizing numerical features

5.1 Standardizing TEACHER NO OF PREV. POSTED PROJ

In [64]:
import warnings
warnings.filterwarnings('ignore')

from sklearn.preprocessing import StandardScaler
standard_vec = StandardScaler()

standard_vec.fit(X_train['teacher_number_of_previously_posted_projects'].values.reshape(-1,1))

X_train_TNPPP_std = standard_vec.transform(X_train['teacher_number_of_previously_posted_projects'].values.reshape(-1,1))
X_cv_TNPP_std = standard_vec.transform(X_cv['teacher_number_of_previously_posted_projects'].values.reshape(-1,1))
X_test_TNPP_std = standard_vec.transform(X_test['teacher_number_of_previously_posted_projects'].values.reshape(-1,1))

print("After vectorizations")
print(X_train_TNPPP_std.shape, y_train.shape)
print(X_cv_TNPP_std.shape, y_cv.shape)
print(X_test_TNPP_std.shape, y_test.shape)
print("="*100)
After vectorizations
(49039, 1) (49039,)
(24155, 1) (24155,)
(36051, 1) (36051,)
====================================================================================================

5.2 Standardizing PRICE

In [65]:
from sklearn.preprocessing import StandardScaler
standard_vec = StandardScaler()

standard_vec.fit(X_train['price'].values.reshape(-1,1))

X_train_price_std = standard_vec.transform(X_train['price'].values.reshape(-1,1))
X_cv_price_std = standard_vec.transform(X_cv['price'].values.reshape(-1,1))
X_test_price_std = standard_vec.transform(X_test['price'].values.reshape(-1,1))

print("After vectorizations")
print(X_train_price_std.shape, y_train.shape)
print(X_cv_price_std.shape, y_cv.shape)
print(X_test_price_std.shape, y_test.shape)
print("="*100)
After vectorizations
(49039, 1) (49039,)
(24155, 1) (24155,)
(36051, 1) (36051,)
====================================================================================================

5.3 Standardizing QUANTITY

In [66]:
from sklearn.preprocessing import StandardScaler
standard_vec = StandardScaler()

standard_vec.fit(X_train['quantity'].values.reshape(-1,1))

X_train_quantity_std = standard_vec.transform(X_train['quantity'].values.reshape(-1,1))
X_cv_quantity_std = standard_vec.transform(X_cv['quantity'].values.reshape(-1,1))
X_test_quantity_std = standard_vec.transform(X_test['quantity'].values.reshape(-1,1))

print("After vectorizations")
print(X_train_quantity_std.shape, y_train.shape)
print(X_cv_quantity_std.shape, y_cv.shape)
print(X_test_quantity_std.shape, y_test.shape)
print("="*100)
After vectorizations
(49039, 1) (49039,)
(24155, 1) (24155,)
(36051, 1) (36051,)
====================================================================================================

5.4 Standardizing TOTALWORDS_TITLE for SET-5

In [67]:
#Standardizing totalwords as its a numerical feature.
import warnings
warnings.filterwarnings('ignore')

from sklearn.preprocessing import StandardScaler
standard_vec = StandardScaler()

standard_vec.fit(X_train['totalwords_title'].values.reshape(-1,1))

X_train_titlecount_std = standard_vec.transform(X_train['totalwords_title'].values.reshape(-1,1))
X_cv_titlecount_std = standard_vec.transform(X_cv['totalwords_title'].values.reshape(-1,1))
X_test_titlecount_std = standard_vec.transform(X_test['totalwords_title'].values.reshape(-1,1))

print("After vectorizations")
print(X_train_titlecount_std.shape, y_train.shape)
print(X_cv_titlecount_std.shape, y_cv.shape)
print(X_test_titlecount_std.shape, y_test.shape)
print("="*100)
After vectorizations
(49039, 1) (49039,)
(24155, 1) (24155,)
(36051, 1) (36051,)
====================================================================================================

5.5 Standardizing TOTALWORDS_ESSAY for SET-5

In [68]:
#Standardizing totalwords as its a numerical feature.
import warnings
warnings.filterwarnings('ignore')

from sklearn.preprocessing import StandardScaler
standard_vec = StandardScaler()

standard_vec.fit(X_train['totalwords_essay'].values.reshape(-1,1))

X_train_essaycount_std = standard_vec.transform(X_train['totalwords_essay'].values.reshape(-1,1))
X_cv_essaycount_std = standard_vec.transform(X_cv['totalwords_essay'].values.reshape(-1,1))
X_test_essaycount_std = standard_vec.transform(X_test['totalwords_essay'].values.reshape(-1,1))

print("After vectorizations")
print(X_train_essaycount_std.shape, y_train.shape)
print(X_cv_essaycount_std.shape, y_cv.shape)
print(X_test_essaycount_std.shape, y_test.shape)
print("="*100)
After vectorizations
(49039, 1) (49039,)
(24155, 1) (24155,)
(36051, 1) (36051,)
====================================================================================================

NOTE: We don't need to Standardize the Sentiment scores as they are probabilities between 0 and 1. Therefore I will just convert those column values to an array.

In [69]:
#Compound Score Array
#You can either do reshape(-1, 1) or <variable_name>.T to transpose your array. I have chosen reshape(-1, 1)

X_train_compsc = np.array([X_train['Compound Score']]).reshape(-1, 1)
X_cv_compsc = np.array([X_cv['Compound Score']]).reshape(-1, 1)
X_test_compsc = np.array([X_test['Compound Score']]).reshape(-1, 1)
In [70]:
#Neutral Score Array
X_train_neusc = np.array([X_train['Neutral Score']]).reshape(-1, 1)
X_cv_neusc = np.array([X_cv['Neutral Score']]).reshape(-1, 1)
X_test_neusc = np.array([X_test['Neutral Score']]).reshape(-1, 1)
In [71]:
#Negative Score Array
X_train_negsc = np.array([X_train['Negative Score']]).reshape(-1, 1)
X_cv_negsc = np.array([X_cv['Negative Score']]).reshape(-1, 1)
X_test_negsc = np.array([X_test['Negative Score']]).reshape(-1, 1)
In [72]:
#Positive Score Array
X_train_possc = np.array([X_train['Positive Score']]).reshape(-1, 1)
X_cv_possc = np.array([X_cv['Positive Score']]).reshape(-1, 1)
X_test_possc = np.array([X_test['Positive Score']]).reshape(-1, 1)

6. Concatinating All Features.

6. 1 Concatinating for BoW

In [73]:
# merge two sparse matrices: https://stackoverflow.com/a/19710648/4084039
#Concatinating features for BoW
from scipy.sparse import hstack
X_tr_bow = hstack((X_train_essay_bow, X_train_title_bow, X_train_resum_bow, X_train_state_ohe, X_train_teacher_ohe, X_train_grade_ohe, X_train_clean_cat_ohe, X_train_clean_subcat_ohe, X_train_TNPPP_std, X_train_price_std, X_train_quantity_std, X_train_titlecount_std, X_train_essaycount_std, X_train_compsc, X_train_neusc, X_train_negsc, X_train_possc)).tocsr()
X_cv_bow = hstack((X_cv_essay_bow, X_cv_title_bow, X_cv_resum_bow, X_cv_state_ohe, X_cv_teacher_ohe, X_cv_grade_ohe, X_cv_clean_cat_ohe, X_cv_clean_subcat_ohe, X_cv_TNPP_std, X_cv_price_std, X_cv_quantity_std, X_cv_titlecount_std, X_cv_essaycount_std, X_cv_compsc, X_cv_neusc, X_cv_negsc, X_cv_possc)).tocsr()
X_te_bow = hstack((X_test_essay_bow, X_test_title_bow, X_test_resum_bow, X_test_state_ohe, X_test_teacher_ohe, X_test_grade_ohe, X_test_clean_cat_ohe, X_test_clean_subcat_ohe, X_test_TNPP_std, X_test_price_std, X_test_quantity_std, X_test_titlecount_std, X_test_essaycount_std, X_test_compsc, X_test_neusc, X_test_negsc, X_test_possc)).tocsr()

print("Final Data matrix")
print(X_tr_bow.shape, y_train.shape)
print(X_cv_bow.shape, y_cv.shape)
print(X_te_bow.shape, y_test.shape)
print("="*100)
Final Data matrix
(49039, 18041) (49039,)
(24155, 18041) (24155,)
(36051, 18041) (36051,)
====================================================================================================

6. 2 Concatinating for TF-IDF

In [81]:
#Concatinating features for TFIDF
from scipy.sparse import hstack
X_tr_tfidf = hstack((X_train_essay_tfidf, X_train_title_tfidf, X_train_resum_tfidf, X_train_state_ohe, X_train_teacher_ohe, X_train_grade_ohe, X_train_clean_cat_ohe, X_train_clean_subcat_ohe, X_train_TNPPP_std, X_train_price_std, X_train_quantity_std, X_train_titlecount_std, X_train_essaycount_std, X_train_compsc, X_train_neusc, X_train_negsc, X_train_possc)).tocsr()
X_cv_tfidf = hstack((X_cv_essay_tfidf, X_cv_title_tfidf, X_cv_resum_tfidf, X_cv_state_ohe, X_cv_teacher_ohe, X_cv_grade_ohe, X_cv_clean_cat_ohe, X_cv_clean_subcat_ohe, X_cv_TNPP_std, X_cv_price_std, X_cv_quantity_std, X_cv_titlecount_std, X_cv_essaycount_std, X_cv_compsc, X_cv_neusc, X_cv_negsc, X_cv_possc)).tocsr()
X_te_tfidf = hstack((X_test_essay_tfidf, X_test_title_tfidf, X_test_resum_tfidf, X_test_state_ohe, X_test_teacher_ohe, X_test_grade_ohe, X_test_clean_cat_ohe, X_test_clean_subcat_ohe, X_test_TNPP_std, X_test_price_std, X_test_quantity_std, X_test_titlecount_std, X_test_essaycount_std, X_test_compsc, X_test_neusc, X_test_negsc, X_test_possc)).tocsr()

print("Final Data matrix")
print(X_tr_tfidf.shape, y_train.shape)
print(X_cv_tfidf.shape, y_cv.shape)
print(X_te_tfidf.shape, y_test.shape)
print("="*100)
Final Data matrix
(49039, 15910) (49039,)
(24155, 15910) (24155,)
(36051, 15910) (36051,)
====================================================================================================

6. 3 Concatinating for AVG W2V

In [92]:
#Concatinating features for AVG W2V
from scipy.sparse import hstack
X_tr_avgw2v = hstack((train_w2v_vectors_essays, w2v_train_vectors_titles, w2v_train_vectors_resum, X_train_state_ohe, X_train_teacher_ohe, X_train_grade_ohe, X_train_clean_cat_ohe, X_train_clean_subcat_ohe, X_train_TNPPP_std, X_train_price_std, X_train_quantity_std, X_train_titlecount_std, X_train_essaycount_std, X_train_compsc, X_train_neusc, X_train_negsc, X_train_possc)).tocsr()
X_cv_avgw2v = hstack((cv_w2v_vectors_essays, w2v_cv_vectors_titles, w2v_cv_vectors_resum, X_cv_state_ohe, X_cv_teacher_ohe, X_cv_grade_ohe, X_cv_clean_cat_ohe, X_cv_clean_subcat_ohe, X_cv_TNPP_std, X_cv_price_std, X_cv_quantity_std, X_cv_titlecount_std, X_cv_essaycount_std, X_cv_compsc, X_cv_neusc, X_cv_negsc, X_cv_possc)).tocsr()
X_te_avgw2v = hstack((test_w2v_vectors_essays, w2v_test_vectors_titles, w2v_test_vectors_resum, X_test_state_ohe, X_test_teacher_ohe, X_test_grade_ohe, X_test_clean_cat_ohe, X_test_clean_subcat_ohe, X_test_TNPP_std, X_test_price_std, X_test_quantity_std, X_test_titlecount_std, X_test_essaycount_std, X_test_compsc, X_test_neusc, X_test_negsc, X_test_possc)).tocsr()

print("Final Data matrix")
print(X_tr_avgw2v.shape, y_train.shape)
print(X_cv_avgw2v.shape, y_cv.shape)
print(X_te_avgw2v.shape, y_test.shape)
print("="*100)
Final Data matrix
(49039, 1008) (49039,)
(24155, 1008) (24155,)
(36051, 1008) (36051,)
====================================================================================================

6.4 Concatinating for TFIDF AVG W2V

In [93]:
#Concatinating features for TFIDF W2V
from scipy.sparse import hstack
X_tr_tfidfw2v = hstack((train_tfidf_w2v_essay, train_tfidf_w2v_title, train_tfidf_w2v_resum, X_train_state_ohe, X_train_teacher_ohe, X_train_grade_ohe, X_train_clean_cat_ohe, X_train_clean_subcat_ohe, X_train_TNPPP_std, X_train_price_std, X_train_quantity_std, X_train_titlecount_std, X_train_essaycount_std, X_train_compsc, X_train_neusc, X_train_negsc, X_train_possc)).tocsr()
X_cv_tfidfw2v = hstack((cv_tfidf_w2v_essay, cv_tfidf_w2v_title, cv_tfidf_w2v_resum, X_cv_state_ohe, X_cv_teacher_ohe, X_cv_grade_ohe, X_cv_clean_cat_ohe, X_cv_clean_subcat_ohe, X_cv_TNPP_std, X_cv_price_std, X_cv_quantity_std, X_cv_titlecount_std, X_cv_essaycount_std, X_cv_compsc, X_cv_neusc, X_cv_negsc, X_cv_possc)).tocsr()
X_te_tfidfw2v = hstack((test_tfidf_w2v_essay, test_tfidf_w2v_title, test_tfidf_w2v_resum, X_test_state_ohe, X_test_teacher_ohe, X_test_grade_ohe, X_test_clean_cat_ohe, X_test_clean_subcat_ohe, X_test_TNPP_std, X_test_price_std, X_test_quantity_std, X_test_titlecount_std, X_test_essaycount_std, X_test_compsc, X_test_neusc, X_test_negsc, X_test_possc)).tocsr()

print("Final Data matrix")
print(X_tr_tfidfw2v.shape, y_train.shape)
print(X_cv_tfidfw2v.shape, y_cv.shape)
print(X_te_tfidfw2v.shape, y_test.shape)
print("="*100)
Final Data matrix
(49039, 1008) (49039,)
(24155, 1008) (24155,)
(36051, 1008) (36051,)
====================================================================================================

6. 5 Concatinating features for SET-5 (Applying TruncatedSVD on TfidfVectorizer of essay text ONLY, project title and project resource summary excluded.)

In [119]:
#Concatinating features for TFIDF W2V
#As mentioned in the assignment I am excluding project_title and only using TfidfVectorizer of essay_text. 
#Also excluding project_resource summary.

#Converting d to d' using TruncatedSVD has been done in section 5.1.

#Concatinating features for TFIDF
from scipy.sparse import hstack
X_tr_set5 = hstack((X_train_trunsvd_tfidf, X_train_state_ohe, X_train_teacher_ohe, X_train_grade_ohe, X_train_clean_cat_ohe, X_train_clean_subcat_ohe, X_train_TNPPP_std, X_train_price_std, X_train_quantity_std, X_train_titlecount_std, X_train_essaycount_std, X_train_compsc, X_train_neusc, X_train_negsc, X_train_possc)).tocsr()
X_cv_set5 = hstack((X_cv_trunsvd_tfidf, X_cv_state_ohe, X_cv_teacher_ohe, X_cv_grade_ohe, X_cv_clean_cat_ohe, X_cv_clean_subcat_ohe, X_cv_TNPP_std, X_cv_price_std, X_cv_quantity_std, X_cv_titlecount_std, X_cv_essaycount_std, X_cv_compsc, X_cv_neusc, X_cv_negsc, X_cv_possc)).tocsr()
X_te_set5 = hstack((X_test_trunsvd_tfidf, X_test_state_ohe, X_test_teacher_ohe, X_test_grade_ohe, X_test_clean_cat_ohe, X_test_clean_subcat_ohe, X_test_TNPP_std, X_test_price_std, X_test_quantity_std, X_test_titlecount_std, X_test_essaycount_std, X_test_compsc, X_test_neusc, X_test_negsc, X_test_possc)).tocsr()

print("Final Data matrix")
print(X_tr_set5.shape, y_train.shape)
print(X_cv_set5.shape, y_cv.shape)
print(X_te_set5.shape, y_test.shape)
print("="*100)
Final Data matrix
(49039, 908) (49039,)
(24155, 908) (24155,)
(36051, 908) (36051,)
====================================================================================================

7. Applying SVM

Set-1: categorical, numerical features + project_title(BOW) + preprocessed_essay (BOW)

With L1 regularization (BOW)

In [74]:
#Citation:
#This code is copied from here: https://stackoverflow.com/a/48803361/4084039
#With L1 Regularization!!

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import SGDClassifier

#when loss='hinge' we're performing Linear SVM. (No kernel is used)

model_bow = SGDClassifier(loss='hinge', penalty='l1', n_jobs = -1, class_weight='balanced')

param_grid = {
    
    'alpha': np.logspace(-4, 4, 9)
    
}

grid = GridSearchCV(model_bow, param_grid, cv=3, scoring='roc_auc')
                    
grid.fit(X_tr_bow, y_train)

alpha = np.logspace(-4, 4, 9)

train_auc = grid.cv_results_["mean_train_score"]
train_scores_std = grid.cv_results_["std_train_score"]
cv_auc = grid.cv_results_["mean_test_score"]
cv_scores_std = grid.cv_results_["std_test_score"]

plt.figure()
plt.title('Model')
plt.xlabel('Hyperparameter: Alpha')
plt.ylabel('AUC')

# plot train scores
plt.semilogx(alpha, train_auc, label='Train AUC', color='darkblue')

# create a shaded area between [mean - std, mean + std]
plt.gca().fill_between(alpha,
                       train_auc - train_scores_std,
                       train_auc + train_scores_std,
                       alpha=0.2,
                       color='darkblue')

plt.semilogx(alpha, cv_auc, label='CV AUC', color='darkorange')

# create a shaded area between [mean - std, mean + std]
plt.gca().fill_between(alpha,
                       cv_auc - cv_scores_std,
                       cv_auc + cv_scores_std,
                       alpha=0.2,
                       color='darkorange')

plt.scatter(alpha, train_auc, label='Train AUC points', color='darkblue')
plt.scatter(alpha, cv_auc, label='CV AUC points', color='darkorange')

#Citation for plotting the legend outside the plot
#url: https://matplotlib.org/users/legend_guide.html

plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()
In [75]:
print("Best parameters with L1 Regularization: ", grid.best_params_)
print('AUC with the best parameters: ', grid.best_score_)
Best parameters with L1 Regularization:  {'alpha': 0.0001}
AUC with the best parameters:  0.6509932175747055

With L2 Regularization (BoW)

In [76]:
#Citation:
#This code is copied from here: https://stackoverflow.com/a/48803361/4084039
#With L1 Regularization!!

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import SGDClassifier

#when loss='hinge' we're performing Linear SVM. (No kernel is used)

model_bow_l2 = SGDClassifier(loss='hinge', penalty='l2', n_jobs = -1, class_weight='balanced')

param_grid = {
    
    'alpha': np.logspace(-4, 4, 9)
    
}

grid_bow_l2 = GridSearchCV(model_bow_l2, param_grid, cv=3, scoring='roc_auc')
                    
grid_bow_l2.fit(X_tr_bow, y_train)

alpha = np.logspace(-4, 4, 9)

train_auc = grid_bow_l2.cv_results_["mean_train_score"]
train_scores_std = grid_bow_l2.cv_results_["std_train_score"]
cv_auc = grid_bow_l2.cv_results_["mean_test_score"]
cv_scores_std = grid_bow_l2.cv_results_["std_test_score"]

plt.figure()
plt.title('Model')
plt.xlabel('Hyperparameter: Alpha')
plt.ylabel('AUC')

# plot train scores
plt.semilogx(alpha, train_auc, label='Train AUC', color='darkblue')

# create a shaded area between [mean - std, mean + std]
plt.gca().fill_between(alpha,
                       train_auc - train_scores_std,
                       train_auc + train_scores_std,
                       alpha=0.2,
                       color='darkblue')

plt.semilogx(alpha, cv_auc, label='CV AUC', color='darkorange')

# create a shaded area between [mean - std, mean + std]
plt.gca().fill_between(alpha,
                       cv_auc - cv_scores_std,
                       cv_auc + cv_scores_std,
                       alpha=0.2,
                       color='darkorange')

plt.scatter(alpha, train_auc, label='Train AUC points', color='darkblue')
plt.scatter(alpha, cv_auc, label='CV AUC points', color='darkorange')

#Citation for plotting the legend outside the plot
#url: https://matplotlib.org/users/legend_guide.html

plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()
In [77]:
print("Best parameters with L1 Regularization: ", grid_bow_l2.best_params_)
print('AUC with the best parameters: ', grid_bow_l2.best_score_)
Best parameters with L1 Regularization:  {'alpha': 0.1}
AUC with the best parameters:  0.7219594435374718

Analysis:

  • After comparing L1 and L2 regularization, we are getting better performance with Alpha = 10**-1 with penalty='l2'
  • L1 with alpha = 0.0001 gives AUC of 0.650 and L2 gives AUC of 0.721 with alpha = 0.1.
In [78]:
%%time

#Citation: plot roc auc curve
#url: https://stackabuse.com/understanding-roc-curves-with-python/

best_alpha = 0.1
best_penalty = 'l2'

from sklearn.calibration import CalibratedClassifierCV
from sklearn.linear_model import SGDClassifier 
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from matplotlib import pyplot

final_model_bow = SGDClassifier(loss='hinge', alpha=best_alpha, penalty=best_penalty, class_weight='balanced', n_jobs = -1)

#Using CalibratedClassifierCV as SVM doesnt natively support probabilities
calibrated = CalibratedClassifierCV(final_model_bow, method='sigmoid', cv=5)
calibrated.fit(X_tr_bow, y_train)

def plot_roc_curve(test_fpr, test_tpr, train_fpr, train_tpr):  
    plt.plot(train_fpr, train_tpr, color='red', label='ROC for train')
    plt.plot(test_fpr, test_tpr, color='orange', label='ROC for test')
    plt.plot([0, 1], [0, 1], color='darkblue', linestyle='--')
    plt.xlabel('C: Hyperparameter')
    plt.ylabel('AUC')
    plt.title('Receiver Operating Characteristic (ROC) Curve')
    plt.legend()
    plt.show()
    
y_test_pred = calibrated.predict_proba(X_te_bow)[:, 1]

y_train_pred = calibrated.predict_proba(X_tr_bow)[:, 1]  

auc_train = roc_auc_score(y_train, y_train_pred)  
print('AUC of Train Data: %.2f' % auc_train)  

auc_test = roc_auc_score(y_test, y_test_pred)  
print('AUC of Test Data: %.2f' % auc_test)  

train_fpr, train_tpr, train_thresholds = roc_curve(y_train, y_train_pred)
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, y_test_pred)  

plot_roc_curve(test_fpr, test_tpr, train_fpr, train_tpr)
AUC of Train Data: 0.76
AUC of Test Data: 0.72
Wall time: 3.55 s
In [79]:
# we are writing our own function for predict, with defined thresould
# we will pick a threshold that will give the least fpr
def predict(proba, threshould, fpr, tpr):
    
    t = threshould[np.argmax(tpr*(1-fpr))]
    
    # (tpr*(1-fpr)) will be maximum if your fpr is very low and tpr is very high
    
    print("the maximum value of tpr*(1-fpr)", max(tpr*(1-fpr)), "for threshold", np.round(t,3))
    predictions = []
    for i in proba:
        if i>=t:
            predictions.append(1)
        else:
            predictions.append(0)
    return predictions
In [80]:
from sklearn.metrics import confusion_matrix

print("Train confusion matrix")
cm_train = confusion_matrix(y_train, predict(y_train_pred, train_thresholds, train_fpr, train_fpr))

class_names = ['negative','positive']
sns.heatmap(cm_train, annot=True, fmt='d',cmap='viridis')
plt.ylabel('Predicted label',size=18)
plt.xlabel('True label',size=18)
plt.title("Train Confusion Matrix\n",size=24)
plt.show()

print("Test confusion matrix")
cm_test = confusion_matrix(y_test, predict(y_test_pred, test_thresholds, test_fpr, test_fpr))

class_names = ['negative','positive']
sns.heatmap(cm_test, annot=True, fmt='d',cmap='viridis')
plt.ylabel('Predicted label',size=18)
plt.xlabel('True label',size=18)
plt.title("Test Confusion Matrix\n",size=24)
plt.show()
Train confusion matrix
the maximum value of tpr*(1-fpr) 0.24999999595850028 for threshold 0.766
Test confusion matrix
the maximum value of tpr*(1-fpr) 0.25 for threshold 0.796

Set-2 categorical, numerical features + project_title(TFIDF)+ preprocessed_essay (TFIDF)

With L1 Regularization (TFIDF)

In [82]:
#Citation:
#This code is copied from here: https://stackoverflow.com/a/48803361/4084039
#With L1 Regularization!

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import SGDClassifier

#when loss='hinge' we're performing Linear SVM. (No kernel is used)

model_tfidf = SGDClassifier(loss='hinge', penalty='l1', n_jobs = -1, class_weight='balanced')

param_grid = {
    
    'alpha': np.logspace(-4, 4, 9)
    
}

grid_tfidf_l1 = GridSearchCV(model_tfidf, param_grid, cv=3, scoring='roc_auc')
                    
grid_tfidf_l1.fit(X_tr_tfidf, y_train)

alpha = np.logspace(-4, 4, 9)

train_auc = grid_tfidf_l1.cv_results_["mean_train_score"]
train_scores_std = grid_tfidf_l1.cv_results_["std_train_score"]
cv_auc = grid_tfidf_l1.cv_results_["mean_test_score"]
cv_scores_std = grid_tfidf_l1.cv_results_["std_test_score"]

plt.figure()
plt.title('Model')
plt.xlabel('Hyperparameter: Alpha')
plt.ylabel('AUC')

# plot train scores
plt.semilogx(alpha, train_auc, label='Train AUC', color='darkblue')

# create a shaded area between [mean - std, mean + std]
plt.gca().fill_between(alpha,
                       train_auc - train_scores_std,
                       train_auc + train_scores_std,
                       alpha=0.2,
                       color='darkblue')

plt.semilogx(alpha, cv_auc, label='CV AUC', color='darkorange')

# create a shaded area between [mean - std, mean + std]
plt.gca().fill_between(alpha,
                       cv_auc - cv_scores_std,
                       cv_auc + cv_scores_std,
                       alpha=0.2,
                       color='darkorange')

plt.scatter(alpha, train_auc, label='Train AUC points', color='darkblue')
plt.scatter(alpha, cv_auc, label='CV AUC points', color='darkorange')

#Citation for plotting the legend outside the plot
#url: https://matplotlib.org/users/legend_guide.html

plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()
In [83]:
print("Best parameters with L1 Regularization: ", grid_tfidf_l1.best_params_)
print('AUC with the best parameters: ', grid_tfidf_l1.best_score_)
Best parameters with L1 Regularization:  {'alpha': 0.001}
AUC with the best parameters:  0.671216581799505

With L2 Regularization (TFIDF)

In [84]:
#Citation:
#This code is copied from here: https://stackoverflow.com/a/48803361/4084039
#With L2 Regularization!

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import SGDClassifier

#when loss='hinge' we're performing Linear SVM. (No kernel is used)

model_tfidf_l2 = SGDClassifier(loss='hinge', penalty='l2', n_jobs = -1, class_weight='balanced')

param_grid = {
    
    'alpha': np.logspace(-4, 4, 9)
    
}

grid_tfidf_l2 = GridSearchCV(model_tfidf_l2, param_grid, cv=3, scoring='roc_auc')
                    
grid_tfidf_l2.fit(X_tr_tfidf, y_train)

alpha = np.logspace(-4, 4, 9)

train_auc = grid_tfidf_l2.cv_results_["mean_train_score"]
train_scores_std = grid_tfidf_l2.cv_results_["std_train_score"]
cv_auc = grid_tfidf_l2.cv_results_["mean_test_score"]
cv_scores_std = grid_tfidf_l2.cv_results_["std_test_score"]

plt.figure()
plt.title('Model')
plt.xlabel('Hyperparameter: Alpha')
plt.ylabel('AUC')

# plot train scores
plt.semilogx(alpha, train_auc, label='Train AUC', color='darkblue')

# create a shaded area between [mean - std, mean + std]
plt.gca().fill_between(alpha,
                       train_auc - train_scores_std,
                       train_auc + train_scores_std,
                       alpha=0.2,
                       color='darkblue')

plt.semilogx(alpha, cv_auc, label='CV AUC', color='darkorange')

# create a shaded area between [mean - std, mean + std]
plt.gca().fill_between(alpha,
                       cv_auc - cv_scores_std,
                       cv_auc + cv_scores_std,
                       alpha=0.2,
                       color='darkorange')

plt.scatter(alpha, train_auc, label='Train AUC points', color='darkblue')
plt.scatter(alpha, cv_auc, label='CV AUC points', color='darkorange')

#Citation for plotting the legend outside the plot
#url: https://matplotlib.org/users/legend_guide.html

plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()
In [85]:
print("Best parameters with L2 Regularization: ", grid_tfidf_l2.best_params_)
print('AUC with the best parameters: ', grid_tfidf_l2.best_score_)
Best parameters with L2 Regularization:  {'alpha': 0.01}
AUC with the best parameters:  0.6954277785914402

Analysis for TFIDF:

  • After comparing L1 and L2 regularization, we are getting better performance with Alpha = 10**-2 with penalty='l2'
  • L1 with alpha = 0.001 gives AUC of 0.67 and L2 gives AUC of 0.695 with alpha = 0.01.
In [88]:
%%time

#Citation: plot roc auc curve
#url: https://stackabuse.com/understanding-roc-curves-with-python/

best_alpha = 0.01
best_penalty = 'l2'

from sklearn.calibration import CalibratedClassifierCV
from sklearn.linear_model import SGDClassifier 
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from matplotlib import pyplot

model_tfidf_final = SGDClassifier(loss='hinge', alpha=best_alpha, penalty=best_penalty, class_weight='balanced', n_jobs = -1)


#Using CalibratedClassifierCV as SVM doesnt natively support probabilities
calibrated_tfidf = CalibratedClassifierCV(model_tfidf_final, method='sigmoid', cv=5)
calibrated_tfidf.fit(X_tr_tfidf, y_train)

def plot_roc_curve(test_fpr, test_tpr, train_fpr, train_tpr):  
    plt.plot(train_fpr, train_tpr, color='red', label='ROC for train')
    plt.plot(test_fpr, test_tpr, color='orange', label='ROC for test')
    plt.plot([0, 1], [0, 1], color='darkblue', linestyle='--')
    plt.xlabel('C: Hyperparameter')
    plt.ylabel('AUC')
    plt.title('Receiver Operating Characteristic (ROC) Curve')
    plt.legend()
    plt.show()
    
y_test_pred = calibrated_tfidf.predict_proba(X_te_tfidf)[:, 1] 

y_train_pred = calibrated_tfidf.predict_proba(X_tr_tfidf)[:, 1]   

auc_train = roc_auc_score(y_train, y_train_pred)  
print('AUC of Train Data: %.2f' % auc_train)  

auc_test = roc_auc_score(y_test, y_test_pred)  
print('AUC of Test Data: %.2f' % auc_test)  

train_fpr, train_tpr, train_thresholds = roc_curve(y_train, y_train_pred)
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, y_test_pred)  

plot_roc_curve(test_fpr, test_tpr, train_fpr, train_tpr)
AUC of Train Data: 0.73
AUC of Test Data: 0.70
Wall time: 3.03 s
In [89]:
# we are writing our own function for predict, with defined thresould
# we will pick a threshold that will give the least fpr
def predict(proba, threshould, fpr, tpr):
    
    t = threshould[np.argmax(tpr*(1-fpr))]
    
    # (tpr*(1-fpr)) will be maximum if your fpr is very low and tpr is very high
    
    print("the maximum value of tpr*(1-fpr)", max(tpr*(1-fpr)), "for threshold", np.round(t,3))
    predictions = []
    for i in proba:
        if i>=t:
            predictions.append(1)
        else:
            predictions.append(0)
    return predictions
In [90]:
from sklearn.metrics import confusion_matrix

print("Train confusion matrix")
cm_train = confusion_matrix(y_train, predict(y_train_pred, train_thresholds, train_fpr, train_fpr))

class_names = ['negative','positive']
sns.heatmap(cm_train, annot=True, fmt='d',cmap='viridis')
plt.ylabel('Predicted label',size=18)
plt.xlabel('True label',size=18)
plt.title("Train Confusion Matrix\n",size=24)
plt.show()

print("Test confusion matrix")
cm_test = confusion_matrix(y_test, predict(y_test_pred, test_thresholds, test_fpr, test_fpr))

class_names = ['negative','positive']
sns.heatmap(cm_test, annot=True, fmt='d',cmap='viridis')
plt.ylabel('Predicted label',size=18)
plt.xlabel('True label',size=18)
plt.title("Test Confusion Matrix\n",size=24)
plt.show()
Train confusion matrix
the maximum value of tpr*(1-fpr) 0.24999999595850028 for threshold 0.801
Test confusion matrix
the maximum value of tpr*(1-fpr) 0.25 for threshold 0.815

Set 3: categorical, numerical features + project_title(AVG W2V)+ preprocessed_essay (AVG W2V)

With L1 Regularization (AVG W2V)

In [94]:
#Citation:
#This code is copied from here: https://stackoverflow.com/a/48803361/4084039
#With L1 Regularization!!

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import SGDClassifier

#when loss='hinge' we're performing Linear SVM. (No kernel is used)

model_avgw2v = SGDClassifier(loss='hinge', penalty='l1', n_jobs = -1, class_weight='balanced')

param_grid = {
    
    'alpha': np.logspace(-4, 4, 9)
    
}

grid_avgw2v_l1 = GridSearchCV(model_avgw2v, param_grid, cv=3, scoring='roc_auc')
                    
grid_avgw2v_l1.fit(X_tr_avgw2v, y_train)

alpha = np.logspace(-4, 4, 9)

train_auc = grid_avgw2v_l1.cv_results_["mean_train_score"]
train_scores_std = grid_avgw2v_l1.cv_results_["std_train_score"]
cv_auc = grid_avgw2v_l1.cv_results_["mean_test_score"]
cv_scores_std = grid_avgw2v_l1.cv_results_["std_test_score"]

plt.figure()
plt.title('Model')
plt.xlabel('Hyperparameter: Alpha')
plt.ylabel('AUC')

# plot train scores
plt.semilogx(alpha, train_auc, label='Train AUC', color='darkblue')

# create a shaded area between [mean - std, mean + std]
plt.gca().fill_between(alpha,
                       train_auc - train_scores_std,
                       train_auc + train_scores_std,
                       alpha=0.2,
                       color='darkblue')

plt.semilogx(alpha, cv_auc, label='CV AUC', color='darkorange')

# create a shaded area between [mean - std, mean + std]
plt.gca().fill_between(alpha,
                       cv_auc - cv_scores_std,
                       cv_auc + cv_scores_std,
                       alpha=0.2,
                       color='darkorange')

plt.scatter(alpha, train_auc, label='Train AUC points', color='darkblue')
plt.scatter(alpha, cv_auc, label='CV AUC points', color='darkorange')

#Citation for plotting the legend outside the plot
#url: https://matplotlib.org/users/legend_guide.html

plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()
In [95]:
print("Best parameters with L2 Regularization: ", grid_avgw2v_l1.best_params_)
print('AUC with the best parameters: ', grid_avgw2v_l1.best_score_)
Best parameters with L2 Regularization:  {'alpha': 0.001}
AUC with the best parameters:  0.6944171512419242

With L2 Regularization (AVG W2V)

In [96]:
#Citation:
#This code is copied from here: https://stackoverflow.com/a/48803361/4084039

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import SGDClassifier

#when loss='hinge' we're performing Linear SVM. (No kernel is used)

model_avgw2v_l2 = SGDClassifier(loss='hinge', penalty='l2', n_jobs = -1, class_weight='balanced')

param_grid = {
    
    'alpha': np.logspace(-4, 4, 9)
    
}

grid_avgw2v_l2 = GridSearchCV(model_avgw2v_l2, param_grid, cv=3, scoring='roc_auc')
                    
grid_avgw2v_l2.fit(X_tr_avgw2v, y_train)

alpha = np.logspace(-4, 4, 9)

train_auc = grid_avgw2v_l2.cv_results_["mean_train_score"]
train_scores_std = grid_avgw2v_l2.cv_results_["std_train_score"]
cv_auc = grid_avgw2v_l2.cv_results_["mean_test_score"]
cv_scores_std = grid_avgw2v_l2.cv_results_["std_test_score"]

plt.figure()
plt.title('Model')
plt.xlabel('Hyperparameter: Alpha')
plt.ylabel('AUC')

# plot train scores
plt.semilogx(alpha, train_auc, label='Train AUC', color='darkblue')

# create a shaded area between [mean - std, mean + std]
plt.gca().fill_between(alpha,
                       train_auc - train_scores_std,
                       train_auc + train_scores_std,
                       alpha=0.2,
                       color='darkblue')

plt.semilogx(alpha, cv_auc, label='CV AUC', color='darkorange')

# create a shaded area between [mean - std, mean + std]
plt.gca().fill_between(alpha,
                       cv_auc - cv_scores_std,
                       cv_auc + cv_scores_std,
                       alpha=0.2,
                       color='darkorange')

plt.scatter(alpha, train_auc, label='Train AUC points', color='darkblue')
plt.scatter(alpha, cv_auc, label='CV AUC points', color='darkorange')

#Citation for plotting the legend outside the plot
#url: https://matplotlib.org/users/legend_guide.html

plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()
In [97]:
print("Best parameters with L2 Regularization: ", grid_avgw2v_l2.best_params_)
print('AUC with the best parameters: ', grid_avgw2v_l2.best_score_)
Best parameters with L2 Regularization:  {'alpha': 0.01}
AUC with the best parameters:  0.7016072095259643

Analysis for AVG W2V:

  • After comparing L1 and L2 regularization, we are getting better performance with Alpha = 10**-2 with penalty='l2'
  • L1 with alpha = 0.001 gives AUC of 0.694 and L2 gives AUC of 0.701 with alpha = 0.01.
In [101]:
%%time

#Citation: plot roc auc curve
#url: https://stackabuse.com/understanding-roc-curves-with-python/

best_alpha = 0.01
best_penalty = 'l2'

from sklearn.linear_model import SGDClassifier 
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from matplotlib import pyplot

model_avgw2v_final = SGDClassifier(loss='hinge', alpha=best_alpha, penalty=best_penalty, class_weight='balanced', n_jobs = -1)

#Using CalibratedClassifierCV as SVM doesnt natively support probabilities
calibrated_avgw2v = CalibratedClassifierCV(model_avgw2v_final, method='sigmoid', cv=5)
calibrated_avgw2v.fit(X_tr_avgw2v, y_train)

def plot_roc_curve(test_fpr, test_tpr, train_fpr, train_tpr):  
    plt.plot(train_fpr, train_tpr, color='red', label='ROC for train')
    plt.plot(test_fpr, test_tpr, color='orange', label='ROC for test')
    plt.plot([0, 1], [0, 1], color='darkblue', linestyle='--')
    plt.xlabel('C: Hyperparameter')
    plt.ylabel('AUC')
    plt.title('Receiver Operating Characteristic (ROC) Curve')
    plt.legend()
    plt.show()
    
y_test_pred = calibrated_avgw2v.predict_proba(X_te_avgw2v)[:, 1] 

y_train_pred = calibrated_avgw2v.predict_proba(X_tr_avgw2v)[:, 1]  

auc_train = roc_auc_score(y_train, y_train_pred)  
print('AUC of Train Data: %.2f' % auc_train)  

auc_test = roc_auc_score(y_test, y_test_pred)  
print('AUC of Test Data: %.2f' % auc_test)  

train_fpr, train_tpr, train_thresholds = roc_curve(y_train, y_train_pred)
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, y_test_pred)  

plot_roc_curve(test_fpr, test_tpr, train_fpr, train_tpr)
AUC of Train Data: 0.72
AUC of Test Data: 0.70
Wall time: 11.8 s
In [102]:
# we are writing our own function for predict, with defined thresould
# we will pick a threshold that will give the least fpr
def predict(proba, threshould, fpr, tpr):
    
    t = threshould[np.argmax(tpr*(1-fpr))]
    
    # (tpr*(1-fpr)) will be maximum if your fpr is very low and tpr is very high
    
    print("the maximum value of tpr*(1-fpr)", max(tpr*(1-fpr)), "for threshold", np.round(t,3))
    predictions = []
    for i in proba:
        if i>=t:
            predictions.append(1)
        else:
            predictions.append(0)
    return predictions
In [103]:
from sklearn.metrics import confusion_matrix

print("Train confusion matrix")
cm_train = confusion_matrix(y_train, predict(y_train_pred, train_thresholds, train_fpr, train_fpr))

class_names = ['negative','positive']
sns.heatmap(cm_train, annot=True, fmt='d',cmap='viridis')
plt.ylabel('Predicted label',size=18)
plt.xlabel('True label',size=18)
plt.title("Train Confusion Matrix\n",size=24)
plt.show()

print("Test confusion matrix")
cm_test = confusion_matrix(y_test, predict(y_test_pred, test_thresholds, test_fpr, test_fpr))

class_names = ['negative','positive']
sns.heatmap(cm_test, annot=True, fmt='d',cmap='viridis')
plt.ylabel('Predicted label',size=18)
plt.xlabel('True label',size=18)
plt.title("Test Confusion Matrix\n",size=24)
plt.show()
Train confusion matrix
the maximum value of tpr*(1-fpr) 0.24999999595850025 for threshold 0.791
Test confusion matrix
the maximum value of tpr*(1-fpr) 0.25 for threshold 0.804

Set 4: categorical, numerical features + project_title(TFIDF AVG W2V)+ preprocessed_essay (TFIDF AVG W2V)

With L1 Regularization (TFIDF AVG W2V)

In [106]:
#Citation:
#This code is copied from here: https://stackoverflow.com/a/48803361/4084039

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import SGDClassifier

#when loss='hinge' we're performing Linear SVM. (No kernel is used)

model_tfidfavgw2v = SGDClassifier(loss='hinge', penalty='l1', n_jobs = -1, class_weight='balanced')

param_grid = {
    
    'alpha': np.logspace(-4, 4, 9)
    
}

grid_tfidfavgw2v_l1 = GridSearchCV(model_tfidfavgw2v, param_grid, cv=5, scoring='roc_auc')
                    
grid_tfidfavgw2v_l1.fit(X_tr_tfidfw2v, y_train)

alpha = np.logspace(-4, 4, 9)

train_auc = grid_tfidfavgw2v_l1.cv_results_["mean_train_score"]
train_scores_std = grid_tfidfavgw2v_l1.cv_results_["std_train_score"]
cv_auc = grid_tfidfavgw2v_l1.cv_results_["mean_test_score"]
cv_scores_std = grid_tfidfavgw2v_l1.cv_results_["std_test_score"]

plt.figure()
plt.title('Model')
plt.xlabel('Hyperparameter: Alpha')
plt.ylabel('AUC')

# plot train scores
plt.semilogx(alpha, train_auc, label='Train AUC', color='darkblue')

# create a shaded area between [mean - std, mean + std]
plt.gca().fill_between(alpha,
                       train_auc - train_scores_std,
                       train_auc + train_scores_std,
                       alpha=0.2,
                       color='darkblue')

plt.semilogx(alpha, cv_auc, label='CV AUC', color='darkorange')

# create a shaded area between [mean - std, mean + std]
plt.gca().fill_between(alpha,
                       cv_auc - cv_scores_std,
                       cv_auc + cv_scores_std,
                       alpha=0.2,
                       color='darkorange')

plt.scatter(alpha, train_auc, label='Train AUC points', color='darkblue')
plt.scatter(alpha, cv_auc, label='CV AUC points', color='darkorange')

#Citation for plotting the legend outside the plot
#url: https://matplotlib.org/users/legend_guide.html

plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()
In [107]:
print("Best parameters with L1 Regularization: ", grid_tfidfavgw2v_l1.best_params_)
print('AUC with the best parameters: ', grid_tfidfavgw2v_l1.best_score_)
Best parameters with L1 Regularization:  {'alpha': 0.001}
AUC with the best parameters:  0.6951248984544203

With L2 Regularization (TFIDF W2V)

In [108]:
#Citation:
#This code is copied from here: https://stackoverflow.com/a/48803361/4084039

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import SGDClassifier

#when loss='hinge' we're performing Linear SVM. (No kernel is used)

model_tfidfavgw2v_l2 = SGDClassifier(loss='hinge', penalty='l2', n_jobs = -1, class_weight='balanced')

param_grid = {
    
    'alpha': np.logspace(-4, 4, 9)
    
}

grid_tfidfavgw2v_l2 = GridSearchCV(model_tfidfavgw2v_l2, param_grid, cv=5, scoring='roc_auc')
                    
grid_tfidfavgw2v_l2.fit(X_tr_tfidfw2v, y_train)

alpha = np.logspace(-4, 4, 9)

train_auc = grid_tfidfavgw2v_l2.cv_results_["mean_train_score"]
train_scores_std = grid_tfidfavgw2v_l2.cv_results_["std_train_score"]
cv_auc = grid_tfidfavgw2v_l2.cv_results_["mean_test_score"]
cv_scores_std = grid_tfidfavgw2v_l2.cv_results_["std_test_score"]

plt.figure()
plt.title('Model')
plt.xlabel('Hyperparameter: Alpha')
plt.ylabel('AUC')

# plot train scores
plt.semilogx(alpha, train_auc, label='Train AUC', color='darkblue')

# create a shaded area between [mean - std, mean + std]
plt.gca().fill_between(alpha,
                       train_auc - train_scores_std,
                       train_auc + train_scores_std,
                       alpha=0.2,
                       color='darkblue')

plt.semilogx(alpha, cv_auc, label='CV AUC', color='darkorange')

# create a shaded area between [mean - std, mean + std]
plt.gca().fill_between(alpha,
                       cv_auc - cv_scores_std,
                       cv_auc + cv_scores_std,
                       alpha=0.2,
                       color='darkorange')

plt.scatter(alpha, train_auc, label='Train AUC points', color='darkblue')
plt.scatter(alpha, cv_auc, label='CV AUC points', color='darkorange')

#Citation for plotting the legend outside the plot
#url: https://matplotlib.org/users/legend_guide.html

plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()
In [109]:
print("Best parameters with L2 Regularization: ", grid_tfidfavgw2v_l2.best_params_)
print('AUC with the best parameters: ', grid_tfidfavgw2v_l2.best_score_)
Best parameters with L2 Regularization:  {'alpha': 0.01}
AUC with the best parameters:  0.7076810583906346

Analysis for TFIDF W2V:

  • After comparing L1 and L2 regularization, we are getting better performance with Alpha = 10**-2 with penalty='l2'
  • L1 with alpha = 0.001 gives AUC of 0.695 and L2 gives AUC of 0.707 with alpha = 0.01.
In [132]:
%%time

#Citation: plot roc auc curve
#url: https://stackabuse.com/understanding-roc-curves-with-python/

best_alpha = 0.01
best_penalty = 'l2'

from sklearn.linear_model import SGDClassifier 
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from matplotlib import pyplot

model_tfidfw2v_final = SGDClassifier(loss='hinge', alpha=best_alpha, penalty=best_penalty, class_weight='balanced', n_jobs = -1)

#Using CalibratedClassifierCV as SVM doesnt natively support probabilities
calibrated_tfidfavgw2v = CalibratedClassifierCV(model_avgw2v_final, method='sigmoid', cv=5)
calibrated_tfidfavgw2v.fit(X_tr_avgw2v, y_train)

def plot_roc_curve(test_fpr, test_tpr, train_fpr, train_tpr):  
    plt.plot(train_fpr, train_tpr, color='red', label='ROC for train')
    plt.plot(test_fpr, test_tpr, color='orange', label='ROC for test')
    plt.plot([0, 1], [0, 1], color='darkblue', linestyle='--')
    plt.xlabel('C: Hyperparameter')
    plt.ylabel('AUC')
    plt.title('Receiver Operating Characteristic (ROC) Curve')
    plt.legend()
    plt.show()
    
y_test_pred = calibrated_tfidfavgw2v.predict_proba(X_te_tfidfw2v)[:, 1] 

y_train_pred = calibrated_tfidfavgw2v.predict_proba(X_tr_tfidfw2v)[:, 1]  

auc_train = roc_auc_score(y_train, y_train_pred)  
print('AUC of Train Data: %.2f' % auc_train)  

auc_test = roc_auc_score(y_test, y_test_pred)  
print('AUC of Test Data: %.2f' % auc_test)  

train_fpr, train_tpr, train_thresholds = roc_curve(y_train, y_train_pred)
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, y_test_pred)  

plot_roc_curve(test_fpr, test_tpr, train_fpr, train_tpr)
AUC of Train Data: 0.72
AUC of Test Data: 0.70
Wall time: 12.4 s
In [111]:
# we are writing our own function for predict, with defined thresould
# we will pick a threshold that will give the least fpr
def predict(proba, threshould, fpr, tpr):
    
    t = threshould[np.argmax(tpr*(1-fpr))]
    
    # (tpr*(1-fpr)) will be maximum if your fpr is very low and tpr is very high
    
    print("the maximum value of tpr*(1-fpr)", max(tpr*(1-fpr)), "for threshold", np.round(t,3))
    predictions = []
    for i in proba:
        if i>=t:
            predictions.append(1)
        else:
            predictions.append(0)
    return predictions
In [112]:
from sklearn.metrics import confusion_matrix

print("Train confusion matrix")
cm_train = confusion_matrix(y_train, predict(y_train_pred, train_thresholds, train_fpr, train_fpr))

class_names = ['negative','positive']
sns.heatmap(cm_train, annot=True, fmt='d',cmap='viridis')
plt.ylabel('Predicted label',size=18)
plt.xlabel('True label',size=18)
plt.title("Train Confusion Matrix\n",size=24)
plt.show()

print("Test confusion matrix")
cm_test = confusion_matrix(y_test, predict(y_test_pred, test_thresholds, test_fpr, test_fpr))

class_names = ['negative','positive']
sns.heatmap(cm_test, annot=True, fmt='d',cmap='viridis')
plt.ylabel('Predicted label',size=18)
plt.xlabel('True label',size=18)
plt.title("Test Confusion Matrix\n",size=24)
plt.show()
Train confusion matrix
the maximum value of tpr*(1-fpr) 0.24999999595850025 for threshold 0.778
Test confusion matrix
the maximum value of tpr*(1-fpr) 0.2499999629322599 for threshold 0.789

Set 5: Apply the Support Vector Machines on these features by finding the best hyper paramter.

Considering The Following Features:

  • school_state : categorical data
  • clean_categories : categorical data
  • clean_subcategories : categorical data
  • project_grade_category :categorical data
  • teacher_prefix : categorical data
  • quantity : numerical data
  • teacher_number_of_previously_posted_projects : numerical data
  • price : numerical data
  • sentiment score's of each of the essay : numerical data
  • number of words in the title : numerical data
  • number of words in the combine essays : numerical data
  • Apply TruncatedSVD on TfidfVectorizer of essay text, choose the TruncatedSVD on TfidfVectorizer of essay text : numerical data

7.5.1 Applying TruncatedSVD on TfidfVectorizer of essay text:

In [113]:
#importing TruncatedSVD
from sklearn import decomposition
TruncSVD = decomposition.TruncatedSVD()
In [114]:
print('Training Matrix Shape:', X_train_essay_tfidf.shape)
print('CV Matrix Shape:', X_cv_essay_tfidf.shape)
print('Test Matrix Shape:', X_test_essay_tfidf.shape)
Training Matrix Shape: (49039, 11928)
CV Matrix Shape: (24155, 11928)
Test Matrix Shape: (36051, 11928)
In [115]:
#Code for dimensionality reduction on training matrix.

#these are principle components and not features
TruncSVD.n_components = 1000

TSVD_train_data = TruncSVD.fit_transform(X_train_essay_tfidf)

#explained_variance gives us the var (lambda_i values) which we divide with summation of those lambda_i values.
percentage_var_explained_train = TruncSVD.explained_variance_ / np.sum(TruncSVD.explained_variance_)

#cumsum keeps adding the lambda_i values in the numerator / with summation of all lambda_i's
cum_var_explained = np.cumsum(percentage_var_explained_train)

# Plot the TruncatedSVD spectrum
plt.figure(1, figsize=(6, 4))

plt.clf()
plt.plot(cum_var_explained, linewidth=2)
plt.axis('tight')
plt.grid()
plt.xlabel('n_components')
plt.ylabel('Cumulative_explained_variance')
plt.show()

#If we take 800-dimensions, a little over 90% of variance is explained for training.
In [116]:
#Code for dimensionality reduction on CV matrix.

#these are principle components and not features
TruncSVD.n_components = 1000

TSVD_cv_data = TruncSVD.fit_transform(X_cv_essay_tfidf)

#explained_variance gives us the var (lambda_i values) which we divide with summation of those lambda_i values.
percentage_var_explained_cv = TruncSVD.explained_variance_ / np.sum(TruncSVD.explained_variance_)

#cumsum keeps adding the lambda_i values in the numerator / with summation of all lambda_i's
cum_var_explained_cv = np.cumsum(percentage_var_explained_cv)

# Plot the TruncatedSVD spectrum
plt.figure(1, figsize=(6, 4))

plt.clf()
plt.plot(cum_var_explained_cv, linewidth=2)
plt.axis('tight')
plt.grid()
plt.xlabel('n_components')
plt.ylabel('Cumulative_explained_variance')
plt.show()

#If we take 800-dimensions, a little over 90% of variance is explained for training
In [117]:
#Code for dimensionality reduction on test matrix.

#these are principle components and not features
TruncSVD.n_components = 1000

TSVD_test_data = TruncSVD.fit_transform(X_test_essay_tfidf)

#explained_variance gives us the var (lambda_i values) which we divide with summation of those lambda_i values.
percentage_var_explained_test = TruncSVD.explained_variance_ / np.sum(TruncSVD.explained_variance_)

#cumsum keeps adding the lambda_i values in the numerator / with summation of all lambda_i's
cum_var_explained_test = np.cumsum(percentage_var_explained_test)

# Plot the TruncatedSVD spectrum
plt.figure(1, figsize=(6, 4))

plt.clf()
plt.plot(cum_var_explained_test, linewidth=2)
plt.axis('tight')
plt.grid()
plt.xlabel('n_components')
plt.ylabel('Cumulative_explained_variance')
plt.show()

#If we take 800-dimensions, a little over 90% of variance is explained for training

Analysis:

  • From the above 3 plots it can be inferenced that they are similar.
  • Therefore if we take 800 dimensions, a little over 90% of variance will be retained.
  • Hence we will now apply TruncatedSVD on train, cv, test of TfidfVectorizer of essay text.
In [118]:
from sklearn.decomposition import TruncatedSVD

#n_components = 800 as shown in the plots above
Truncated_SVD_tfidf = TruncatedSVD(n_components=800, random_state=0)
Truncated_SVD_tfidf.fit(X_train_essay_tfidf) #fit only on train data

X_train_trunsvd_tfidf = Truncated_SVD_tfidf.transform(X_train_essay_tfidf)
X_cv_trunsvd_tfidf = Truncated_SVD_tfidf.transform(X_cv_essay_tfidf)
X_test_trunsvd_tfidf = Truncated_SVD_tfidf.transform(X_test_essay_tfidf)

print('Before Vectorization')
print(X_train_essay_tfidf.shape, y_train.shape)
print(X_cv_essay_tfidf.shape, y_cv.shape)
print(X_test_essay_tfidf.shape, y_test.shape)

print('*'*50)

print("After Vectorization")
print(X_train_trunsvd_tfidf.shape, y_train.shape)
print(X_cv_trunsvd_tfidf.shape, y_cv.shape)
print(X_test_trunsvd_tfidf.shape, y_test.shape)
Before Vectorization
(49039, 11928) (49039,)
(24155, 11928) (24155,)
(36051, 11928) (36051,)
**************************************************
After Vectorization
(49039, 800) (49039,)
(24155, 800) (24155,)
(36051, 800) (36051,)

7.5.2. Applying SVM after TruncatedSVD on essay_text

With L1 Regularizer

Concatenation of all features has been done in section 6.5

In [120]:
#Citation:
#This code is copied from here: https://stackoverflow.com/a/48803361/4084039

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import SGDClassifier

#when loss='hinge' we're performing Linear SVM. (No kernel is used)

model_tfidf_trunsvd = SGDClassifier(loss='hinge', penalty='l1', n_jobs = -1, class_weight='balanced')

param_grid = {
    
    'alpha': np.logspace(-4, 4, 9)
    
}

grid_tfidf_trunsvd_l1 = GridSearchCV(model_tfidf_trunsvd, param_grid, cv=3, scoring='roc_auc')
                    
grid_tfidf_trunsvd_l1.fit(X_tr_set5, y_train)

alpha = np.logspace(-4, 4, 9)

train_auc = grid_tfidf_trunsvd_l1.cv_results_["mean_train_score"]
train_scores_std = grid_tfidf_trunsvd_l1.cv_results_["std_train_score"]
cv_auc = grid_tfidf_trunsvd_l1.cv_results_["mean_test_score"]
cv_scores_std = grid_tfidf_trunsvd_l1.cv_results_["std_test_score"]

plt.figure()
plt.title('Model')
plt.xlabel('Hyperparameter: Alpha')
plt.ylabel('AUC')

# plot train scores
plt.semilogx(alpha, train_auc, label='Train AUC', color='darkblue')

# create a shaded area between [mean - std, mean + std]
plt.gca().fill_between(alpha,
                       train_auc - train_scores_std,
                       train_auc + train_scores_std,
                       alpha=0.2,
                       color='darkblue')

plt.semilogx(alpha, cv_auc, label='CV AUC', color='darkorange')

# create a shaded area between [mean - std, mean + std]
plt.gca().fill_between(alpha,
                       cv_auc - cv_scores_std,
                       cv_auc + cv_scores_std,
                       alpha=0.2,
                       color='darkorange')

plt.scatter(alpha, train_auc, label='Train AUC points', color='darkblue')
plt.scatter(alpha, cv_auc, label='CV AUC points', color='darkorange')

#Citation for plotting the legend outside the plot
#url: https://matplotlib.org/users/legend_guide.html

plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()
In [121]:
print("Best parameters with L1 Regularization: ", grid_tfidf_trunsvd_l1.best_params_)
print('AUC with the best parameters: ', grid_tfidf_trunsvd_l1.best_score_)
Best parameters with L1 Regularization:  {'alpha': 0.0001}
AUC with the best parameters:  0.6946643944947872

With L2 Regularizer

In [122]:
#Citation:
#This code is copied from here: https://stackoverflow.com/a/48803361/4084039
#With L1 Regularization!!

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import SGDClassifier

#when loss='hinge' we're performing Linear SVM. (No kernel is used)

model_tfidf_trunsvd_l2 = SGDClassifier(loss='hinge', penalty='l2', n_jobs = -1, class_weight='balanced')

param_grid = {
    
    'alpha': np.logspace(-4, 4, 9)
    
}

grid_tfidf_trunsvd_l2 = GridSearchCV(model_tfidf_trunsvd_l2, param_grid, cv=3, scoring='roc_auc')
                    
grid_tfidf_trunsvd_l2.fit(X_tr_set5, y_train)

alpha = np.logspace(-4, 4, 9)

train_auc = grid_tfidf_trunsvd_l2.cv_results_["mean_train_score"]
train_scores_std = grid_tfidf_trunsvd_l2.cv_results_["std_train_score"]
cv_auc = grid_tfidf_trunsvd_l2.cv_results_["mean_test_score"]
cv_scores_std = grid_tfidf_trunsvd_l2.cv_results_["std_test_score"]

plt.figure()
plt.title('Model')
plt.xlabel('Hyperparameter: Alpha')
plt.ylabel('AUC')

# plot train scores
plt.semilogx(alpha, train_auc, label='Train AUC', color='darkblue')

# create a shaded area between [mean - std, mean + std]
plt.gca().fill_between(alpha,
                       train_auc - train_scores_std,
                       train_auc + train_scores_std,
                       alpha=0.2,
                       color='darkblue')

plt.semilogx(alpha, cv_auc, label='CV AUC', color='darkorange')

# create a shaded area between [mean - std, mean + std]
plt.gca().fill_between(alpha,
                       cv_auc - cv_scores_std,
                       cv_auc + cv_scores_std,
                       alpha=0.2,
                       color='darkorange')

plt.scatter(alpha, train_auc, label='Train AUC points', color='darkblue')
plt.scatter(alpha, cv_auc, label='CV AUC points', color='darkorange')

#Citation for plotting the legend outside the plot
#url: https://matplotlib.org/users/legend_guide.html

plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()
In [123]:
print("Best parameters with L1 Regularization: ", grid_tfidf_trunsvd_l2.best_params_)
print('AUC with the best parameters: ', grid_tfidf_trunsvd_l2.best_score_)
Best parameters with L1 Regularization:  {'alpha': 0.001}
AUC with the best parameters:  0.693447247749568

Analysis:

  • After comparing L1 and L2 regularization, we are getting better performance with Alpha = 10**-4 with penalty='l1'
  • L1 with alpha = 0.0001 gives AUC of 0.694 and L2 gives AUC of 0.695 with alpha = 0.001.
In [127]:
%%time

#Citation: plot roc auc curve
#url: https://stackabuse.com/understanding-roc-curves-with-python/

best_alpha = 0.0001
best_penalty = 'l1'

from sklearn.linear_model import SGDClassifier 
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from matplotlib import pyplot

model_tfidf_trunsvd_final = SGDClassifier(loss='hinge', alpha=best_alpha, penalty=best_penalty, class_weight='balanced', n_jobs = -1)

#Using CalibratedClassifierCV as SVM doesnt natively support probabilities
calibrated_tfidf_set5 = CalibratedClassifierCV(model_tfidf_trunsvd_final, method='sigmoid', cv=5)
calibrated_tfidf_set5.fit(X_tr_set5, y_train)

def plot_roc_curve(test_fpr, test_tpr, train_fpr, train_tpr):  
    plt.plot(train_fpr, train_tpr, color='red', label='ROC for train')
    plt.plot(test_fpr, test_tpr, color='orange', label='ROC for test')
    plt.plot([0, 1], [0, 1], color='darkblue', linestyle='--')
    plt.xlabel('C: Hyperparameter')
    plt.ylabel('AUC')
    plt.title('Receiver Operating Characteristic (ROC) Curve')
    plt.legend()
    plt.show()
    
y_test_pred = calibrated_tfidf_set5.predict_proba(X_te_set5)[:, 1] 

y_train_pred = calibrated_tfidf_set5.predict_proba(X_tr_set5)[:, 1]   

auc_train = roc_auc_score(y_train, y_train_pred)  
print('AUC of Train Data: %.2f' % auc_train)  

auc_test = roc_auc_score(y_test, y_test_pred)  
print('AUC of Test Data: %.1f' % auc_test)  

train_fpr, train_tpr, train_thresholds = roc_curve(y_train, y_train_pred)
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, y_test_pred)  

plot_roc_curve(test_fpr, test_tpr, train_fpr, train_tpr)
AUC of Train Data: 0.75
AUC of Test Data: 0.7
Wall time: 39.5 s
In [128]:
# we are writing our own function for predict, with defined thresould
# we will pick a threshold that will give the least fpr
def predict(proba, threshould, fpr, tpr):
    
    t = threshould[np.argmax(tpr*(1-fpr))]
    
    # (tpr*(1-fpr)) will be maximum if your fpr is very low and tpr is very high
    
    print("the maximum value of tpr*(1-fpr)", max(tpr*(1-fpr)), "for threshold", np.round(t,3))
    predictions = []
    for i in proba:
        if i>=t:
            predictions.append(1)
        else:
            predictions.append(0)
    return predictions
In [129]:
from sklearn.metrics import confusion_matrix

print("Train confusion matrix")
cm_train = confusion_matrix(y_train, predict(y_train_pred, train_thresholds, train_fpr, train_fpr))

class_names = ['negative','positive']
sns.heatmap(cm_train, annot=True, fmt='d',cmap='viridis')
plt.ylabel('Predicted label',size=18)
plt.xlabel('True label',size=18)
plt.title("Train Confusion Matrix\n",size=24)
plt.show()

print("Test confusion matrix")
cm_test = confusion_matrix(y_test, predict(y_test_pred, test_thresholds, test_fpr, test_fpr))

class_names = ['negative','positive']
sns.heatmap(cm_test, annot=True, fmt='d',cmap='viridis')
plt.ylabel('Predicted label',size=18)
plt.xlabel('True label',size=18)
plt.title("Test Confusion Matrix\n",size=24)
plt.show()
Train confusion matrix
the maximum value of tpr*(1-fpr) 0.24999999595850028 for threshold 0.787
Test confusion matrix
the maximum value of tpr*(1-fpr) 0.25 for threshold 0.807

8. Conclusion

In [130]:
#Citation: 
#url: http://zetcode.com/python/prettytable/

from prettytable import PrettyTable
    
x = PrettyTable()

x.field_names = ["Index", "Vectorizer", "Model->(Linear SVM)", "Hyper-parameter \u03B1", "Best Regularizer", "AUC"]

x.add_row(["Set 1", "Bag Of Words", "SGD with hinge loss", 0.1, 'L2', 0.72])
x.add_row(["Set 2", "TFIDF", "SGD with hinge loss", 0.01, 'L2', 0.70])
x.add_row(["Set 3", "AVG W2V", "SGD with hinge loss", 0.01, 'L2', 0.70])
x.add_row(["Set 4", "TFIDF", "SGD with hinge loss", 0.01, 'L2', 0.70])
x.add_row(["Set 5", "TFIDF on TruncatedSVD(Essay_Text)", "SGD with hinge loss", 0.0001, 'L1', 0.70])

print(x)
+-------+-----------------------------------+---------------------+-------------------+------------------+------+
| Index |             Vectorizer            | Model->(Linear SVM) | Hyper-parameter α | Best Regularizer | AUC  |
+-------+-----------------------------------+---------------------+-------------------+------------------+------+
| Set 1 |            Bag Of Words           | SGD with hinge loss |        0.1        |        L2        | 0.72 |
| Set 2 |               TFIDF               | SGD with hinge loss |        0.01       |        L2        | 0.7  |
| Set 3 |              AVG W2V              | SGD with hinge loss |        0.01       |        L2        | 0.7  |
| Set 4 |               TFIDF               | SGD with hinge loss |        0.01       |        L2        | 0.7  |
| Set 5 | TFIDF on TruncatedSVD(Essay_Text) | SGD with hinge loss |       0.0001      |        L1        | 0.7  |
+-------+-----------------------------------+---------------------+-------------------+------------------+------+